[Biopython-dev] Fwd: Fast instance search of motif in a sequence

Tue Feb 12 23:18:17 UTC 2013

Hi all,

I am working on comparative genomics and I frequently use Motif module of
Biopython. One of the most frequent operations that I do is to build a
motif out of sites and search a sequence to find instances that are similar
to the motif [Bio.Motif._Motif.search_instances()].

The problem is that the sequence that instances are searched is huge.
Mostly it is the genome sequence itself, with its reverse complement. For
example, scanning the E.coli genome + its reverse complement with a motif
of length ~20 takes almost a minute in my machine.

To make it faster, I implemented a C version of it and a Python interface
so that you can call it from Python. It is pretty fast, it takes about ~2.5
seconds.

Current implementation can be found at:

https://github.com/sefakilic/yassi

If anyone is interested and it is appropriate, I would like to modify the
current implementation and integrate it into Biopython.

Thanks!

Sefa Kilic