[Bioperl-l] An update for my DNA Smith-Waterman code

Sun Jan 26 10:15:39 EST 2003

On Sat, 25 Jan 2003, Yee Man wrote:

> I am aware of this implementation and I have the code. However, I can't
> seem to compile the package on my SPARC.

What errors are you getting?  We don't typically develop on SPARC (Sun,
DEC, Linux, Win*, and Mac OS X are the primary environments).

> Also I would like to know if there are any paper that descries his
> algorithm before I dive into the code.

Look for papers by Pearson and Miller.

> Do you have a short C program that reads two FASTAs and calls the
> functions in dropgsw.c?

No; many have asked us to put this code into a linkable library, but since
this doesn't further our own research interests it's difficult to invest
the necessary time.  We'd be happy to help anyone who wished to pursue
this, though ... the "code-paths" you want to follow in dropgsw.c are
"do_work" (calculate the score only), and "do_walign" (calculate the
entire alignment).

> > I can also send you the patches to make it altivec-enabled (and
> > approx. 6-8 times faster).

> Is this something that multi-threads the program to run on multiple CPUs?

No, it allows the algorithm to run using vector datatypes so that
calculating the DP matrix can be done in parallel.  But the FASTA programs
all have threaded versions, as well as PVM and MPI runmodes.

> I tried it but it seems to me ssearch34 only display the score only. How
> can I get the alignment? How can I set gap penalty? I read the
> accompanied fasta3x.doc but I still can't figure it out...

Make sure you are using a recent (version 3.4) distribution.

% ssearch34 -n -q -H -d 0 -b 1 t1.fa t2.fa

( the -f and -g parameters change the gap penalties for gap open and gap
extend, respectively; if you have an older version of the package, the gap
open penalty *includes* the first gap extension -- more recent versions do
not include the first gap extension in the gap open penalty )

This calculates scores only, no alignments; On my 1GHz G4, running
altivec'ed SW, this takes 1.6 seconds to calculate the SW score (32767)
(non altivec'ed on this pair with osearch34 [non-SWAT optimized] takes 5.6
seconds)

If I change the -d 0 to -d 1, ssearch34 also calculates the alignment
using the linear space, divide and conquer method.  This part is *not*
altivec accelerated (yet), and does, as you describe, make multiple passes
through the matrix to get start/end coordinates.  That takes about 9
additional seconds to generate the alignment.

On this same machine, your align.pl on t1.fa and t2.fa takes 33 seconds.
Running on seqs of length 1 shows that the various perl/C binding overhead
takes about 0.5 seconds, so I'm guessing that's not the source of the
difference.

-Aaron

-- 
 Aaron J Mackey
 Pearson Laboratory
 University of Virginia
 (434) 924-2821
 amackey at virginia.edu