[Bioperl-l] Blast Output and frac_aligned_query

Tue Jul 20 06:44:03 EDT 2004

On Jul 20, 2004, at 4:50 AM, James Wasmuth wrote:

> Thanks Aaron, time is a slight issue as I'm carrying out several 
> million comparisions but I'll concede accuracy is the more important 
> feature...

Right, another common fallacy with bl2seq: several million pairwise 
comparisons always sounds like alot, until one realizes that a single 
search of the "nr" database is of the same magnitude.  Sure, BLAST will 
finish this amount of work in less than 10 minutes, but do we really 
mind waiting an hour or two to get better alignments?  You're going to 
spend far more time on the analysis, why not make it easier on yourself 
in the long run (and not have to worry about niggling questions like 
"Hmm, I wonder if BLAST actually aligned all of the homologous regions, 
or only those disjoint, slowly-evolving fragments it could easily 
find"; this is particularly relevant when using BLAST to align DNA to 
either DNA or protein).

As an aside, this is exactly the kind of batch processing targeted by 
various task distribution clients (e.g. "disperse").  With a modicum of 
processing power (say 4-8 modern CPUs), we routinely batch process 
millions of pairwise alignments with SSEARCH, PRSS, and/or LALIGN.

Additionally, for the common "all-vs-all" matrix of pairwise alignment 
case, SSEARCH has the "-I" option, which evaluates only the 
lower-triangle of the matrix (thus, providing the A vs. B, but not B 
vs. A alignment; these are guaranteed to have identical alignments and 
scores, but probably different E() values and bit scores; but you were 
already using PRSS or PRFX to confirm pairwise significances, right?).

And to add just a bit more icing to the cake, SSEARCH runs efficiently 
under both PVM and MPI parallel environments; so the 10-100 fold 
"slow-down" associated with SW can be nicely ameliorated with 8 to 32 
cluster nodes (unless your database is very big, more than 32 nodes 
will typically not be any more efficient).   For those with multi-CPU 
machines, you can also build threaded SSEARCH for single workstation 
use.

This public service message brought to you by the fine makers of: 
FASTA, the original search algorithm

Add grains of salt to taste.  And thanks, James, for being my scapegoat 
of the day.

-Aaron

--
Aaron J. Mackey, Ph.D.
Dept. of Biology, Goddard 212
University of Pennsylvania       email:  amackey at pcbi.upenn.edu
415 S. University Avenue         office: 215-898-1205
Philadelphia, PA  19104-6017     fax:    215-746-6697