[Biopython] Parsing large blast files
Peter Cock
p.j.a.cock at googlemail.com
Mon Apr 27 10:54:09 UTC 2009
On Mon, Apr 27, 2009 at 11:34 AM, Stefanie Lück
<lueck at ipk-gatersleben.de> wrote:
> Hi!
>
> I want to blast many sequences against one DB and parse the outputs.
> At the moment, I do it in that way:
>
> ...
>
> This works but I think it's quite slow. I tried also the NCBIStandalone.Iterator()
> code from the tutrorial but I got the error message "Invalid header".
> Would NCBIStandalone.Iterator() be faster?
NCBIStandalone.Iterator() is the old semi-obsolete plain text parser - it won't
parse the XML output, hence the "Invalid header" error. Maybe the tutorial
(or the error message) could be clearer.
>
> Or, is there a way not to save a xml file or to save only the best hits
> (100 % match)?
>
You could set the expectation threshold (I don't think there is an
identity threshold which would be ideal for your example).
If you only want the single BEST hit for a query, set the number of
alignments and/or descriptions to show to just one (these do different
things in the plain text output - maybe for XML output you only need
to limit the number of alignments). This should give a much smaller
file, which will be fast to parse.
Finally, and perhaps most importantly - don't do an individual BLAST
query for each record. Instead, prepare a FASTA file of ALL your
queries, and use that as the input to BLAST. This way there is only
one command line call, and the BLAST database is only loaded into
memory once.
Peter
More information about the Biopython
mailing list