[Biopython] NcbiblastnCommandline vs subprocess blast
Peter Cock
p.j.a.cock at googlemail.com
Wed Aug 21 09:39:04 UTC 2013
On Wed, Aug 21, 2013 at 9:41 AM, Tanya Golubchik
<golubchi at stats.ox.ac.uk> wrote:
> Hello,
>
> The following refers to Biopython 1.61. Does anyone know if there are any
> hidden or hard-coded defaults for any parameters in NcbiblastnCommandline?
> Or any known bugs that could cause hits not to be reported?
>
> I've encountered an extremely frustrating issue that I've never seen before.
> The upshot is that the blastn result obtained through NcbiblastnCommandline
> occasionally reports "no hits" in a way that seems dependent on the query
> file. This strange output is different from that obtained via a subprocess
> call (or outside python entirely) -- both of which recover all hits
> consistently. I am using *exactly* the same parameters, inputs, and exactly
> the same path to the blastn executable.
>
> What actually happens is that for my large fasta file of 3000-odd nucleotide
> queries, several give no hits with the NcbiblastnCommandline call (these are
> not at the end of the query file, just throughout the file). On the other
> hand, cutting the file down to about 2000 queries, but not changing it in
> any other way, does give hits for these missing queries. Note that *nothing*
> else is changed; the parameters and call to blastn remain identical. This
> only happens for some, not all, of the blast databases I'm searching, making
> it look like there are variable deletions between the samples.
>
> I can provide (large) test data files if anyone thinks they can help. I have
> the query file that produces the wrong 'patchy' output, another that
> produces the correct output, and a sample blast database for which this
> happens.
>
> The actual calls are (paths substituted):
>
> NcbiblastnCommandline(cmd='/path/to/blastn', outfmt=5,
> query='untrimmed.fasta', db='/path/to/db/C00006635', gapopen=5, gapextend=2,
> culling_limit=2)
>
> The above gives 'no hits' for about 25 queries out of the 3000+ in the file.
Are you using it like this, which also uses subprocess,
cline = NcbiblastnCommandline(...)
stdout, stderr = cline()
> stdout, stderr = sp.Popen("/path/to/blastn -db /path/to/db/C00006635 -query
> untrimmed.fasta -outfmt 5 -word_size 17 -gapopen 5 -gapextend 2
> -culling_limit 2".split(), stdout=sp.PIPE, stderr=sp.PIPE).communicate()
>
> The above call to subprocess returns all hits, correctly.
Note there is a subtle difference in the order of the command line
which could (depending on how the command line parsing is done)
reveal a bug in BLAST:
>>> from Bio.Blast.Applications import NcbiblastnCommandline
>>> cline = NcbiblastnCommandline(cmd='/path/to/blastn', outfmt=5, query='untrimmed.fasta', db='/path/to/db/C00006635', gapopen=5, gapextend=2, culling_limit=2)
>>> str(cline)
'/path/to/blastn -outfmt 5 -query untrimmed.fasta -db
/path/to/db/C00006635 -gapopen 5 -gapextend 2 -culling_limit 2'
Just to rule this out, retry running this string instead.
However, the more likely explanation is you didn't set the wordsize,
unless that is a typo in this email?
Regards,
Peter
More information about the Biopython
mailing list