[Biopython] Parsing large blast files

Tue Apr 28 08:33:52 UTC 2009

On Tue, Apr 28, 2009 at 9:23 AM, Stefanie Lück <lueck at ipk-gatersleben.de> wrote:
> Thanks Peter!
>>
>> You could set the expectation threshold (I don't think there is an
>> identity threshold which would be ideal for your example).
>
> I can't say what will be the expectation treshold. This won't work.

Still might be able to reduce it from the default of 10.0, maybe even
just to 1.0, without loosing the very high identity matches you want.

>> If you only want the single BEST hit for a query, set the number of
>> alignments and/or descriptions to show to just one (these do different
>> things in the plain text output - maybe for XML output you only need
>> to limit the number of alignments).  This should give a much smaller
>> file, which will be fast to parse.
>
> This is to risky. There might be several 100 % hits which I need.

If you expect and want several hits per query, then my suggestion is
in appropriate.

>> Finally, and perhaps most importantly - don't do an individual BLAST
>> query for each record.  Instead, prepare a FASTA file of ALL your
>> queries, and use that as the input to BLAST.  This way there is only
>> one command line call, and the BLAST database is only loaded into
>> memory once.
>
> Cool, I didn't know that this will work! Great, that's very nice! 50 % time
> speed up!

Only a 50% time speed up? i.e. It took half the time?  Not bad,
although I expected more.  It will probably depend on the number of
queries, their sizes, and the database - probably the speed up would
be more for a larger database like NR.

Peter