[Biojava-l] ssaha

Wed Apr 23 12:40:37 EDT 2003

On Wed, Apr 23, 2003 at 04:09:43PM +1200, Schreiber, Mark wrote:
> Hi -
>  
> I have been using the ssaha package in biojava and it's really great, good work guys!!
>  
> I have one minor question, Am I correct in assuming that hits are reported only if a segment from the query sequence exactly matches a word in the seqstore? Does ssaha allow for partial matches?

Yes, the point of ssaha and similar lookup-table algorithms is to
give you the fastest possible detection of exact matches.
You can get a more `general purpose' search algorithm by taking
each ssaha word hit and then using dynamic programming to extend
the alignment outwards (this is, to a first approximation, the
blast algorithm).  Obviously if you just want straightforward
seed-and-extend searching, you're probably better off using
blast, but if you want to develop something a bit more specialized,
biojava with the ssaha and dp packages is a nice platform to start
from.

It's possible to improve the sensitivity somewhat by building
a hashtable from non-contiguous patterns of nucleotides, rather
than simple `words'.  For example, a while back I developed a
rather specialized application which found all matches for a
given 25mer with up to 2 mismatched bases.  You can do this by
hashing the genome on 19 base words, but only recording 12 of
those bases (the optimal pattern for this particular application
is 1011100101110010111).

For more general cases, there was an interesting paper on choice
of seed patterns at this year's RECOMB.

If you're interested (and don't mind a bit of tidying up work)
I can send you code for this.

    Thomas.