[Biopython] Parsing large blast files

Stefanie Lück lueck at ipk-gatersleben.de
Tue Apr 28 06:05:30 EDT 2009


Hi Peter!

I'll play a little bit with the tresholds, also the short queries parameters 
(http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/blastall_node74.html) 
which I actually need (nt = 21 bp). Of course, e = 1000 makes it even 
slower.

>Only a 50% time speed up? i.e. It took half the time?  Not bad,
>although I expected more.  It will probably depend on the number of
>queries, their sizes, and the database - probably the speed up would
>be more for a larger database like NR.

I blast ~3000 queries against the tigr barley v9 DB (50500 subjects). It 
takes about 35 seconds with XP, E8400 (3GHZ), 4 GB RAM. Hope this is 
normal...

Kind regards
Stefanie


----- Original Message ----- 
From: "Peter Cock" <p.j.a.cock at googlemail.com>
To: "Stefanie Lück" <lueck at ipk-gatersleben.de>
Cc: <biopython at lists.open-bio.org>
Sent: Tuesday, April 28, 2009 10:33 AM
Subject: Re: [Biopython] Parsing large blast files


On Tue, Apr 28, 2009 at 9:23 AM, Stefanie Lück <lueck at ipk-gatersleben.de> 
wrote:
> Thanks Peter!
>>
>> You could set the expectation threshold (I don't think there is an
>> identity threshold which would be ideal for your example).
>
> I can't say what will be the expectation treshold. This won't work.

Still might be able to reduce it from the default of 10.0, maybe even
just to 1.0, without loosing the very high identity matches you want.

>> If you only want the single BEST hit for a query, set the number of
>> alignments and/or descriptions to show to just one (these do different
>> things in the plain text output - maybe for XML output you only need
>> to limit the number of alignments). This should give a much smaller
>> file, which will be fast to parse.
>
> This is to risky. There might be several 100 % hits which I need.

If you expect and want several hits per query, then my suggestion is
in appropriate.

>> Finally, and perhaps most importantly - don't do an individual BLAST
>> query for each record. Instead, prepare a FASTA file of ALL your
>> queries, and use that as the input to BLAST. This way there is only
>> one command line call, and the BLAST database is only loaded into
>> memory once.
>
> Cool, I didn't know that this will work! Great, that's very nice! 50 % 
> time
> speed up!

Only a 50% time speed up? i.e. It took half the time?  Not bad,
although I expected more.  It will probably depend on the number of
queries, their sizes, and the database - probably the speed up would
be more for a larger database like NR.

Peter



More information about the Biopython mailing list