[Biopython] MOODS: fast search for position weight matrix matches in DNA sequences.

Thu Sep 24 11:46:42 UTC 2009

On Thu, Sep 24, 2009 at 11:59 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> I'm forwarding an interesting post from Dave to the BioPerl mailing list, which
> should also be of interest here...
>
> Just this week, for example, there is this, which could go both on a static
> page and in the newsfeed:
> http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp554v1
>
> MOODS: fast search for position weight matrix matches in DNA sequences.
>
> Korhonen J, Martinmäki P, Pizzi C, Rastas P, Ukkonen E.
> Department of Computer Science and Helsinki Institute for Information
> Technology,
> University of Helsinki, Helsinki, Finland.
>
> SUMMARY: MOODS (MOtif Occurrence Detection Suite) is a software package for
> matching position weight matrices against DNA sequences. MOODS implements
> state-of-the-art on-line matching algorithms, achieving considerably faster
> scanning speed than with a simple brute-force search. MOODS is written in C++,
> with bindings for the popular BioPerl and Biopython toolkits. It can easily be
> adapted for different purposes and integrated into existing workflows. It can
> also be used as a C++ library. AVAILABILITY: The package with documentation and
> examples of usage is available at http://www.cs.helsinki.fi/group/pssmfind. The
> source code is also available under the terms of a GNU General Public License
> (GPL). CONTACT: janne.h.korhonen at helsinki.fi.
Hi all,

I've seen this paper. It is directly related to the Bio.Motif code.
They did a pretty good job of implementing an extremely efficient tool
for finding motif instances in DNA sequences. it's c++ and it beats my
pure python, brute-force code with both hands down... Of course this
come at a price of only being applicable to DNA (only unambiguous
alphabet etc.). Since they did the comparison, we have actually
incorporated the _pwm.c module written by Michiel, which is also much
faster and can be used for finding motifs in DNA.

I have compared their performance with our code on a single Drosophila
chromosme (20Mb) the results are similarly devastating to my old code:
their code takes ~1.1 sec (advanced look-ahead algorithms in C++)
while mine (pure python) takes 350 secs. The code contributed recently
by Michiel (simple algorithm, but in C) takes 2.3secs to finish.

since they provide python interface (there is nothing biopython
related, despite their abstract), I was even thinking about
incorporating their code into Biopython, but it's GPL, Instead, I can
make the function using Michiel's code aware of the MOODS package:
i.e. use it if it is installed.

If we want to put it into the news, It would be worth mentioning that
(thanks to Michiel) we have made quite some progress on that front.

As a side note, I feel a little bit guilty of making biopython look
slow compared to other tools. In the paper, they show a comparison
between different tools (MOODS, bioperl, biopython) in terms of speed,
which shows biopython as by far the slowest. This is just because I
was not writing this  code with speed in mind (I work on short
regulatory sequences...). Nonetheless, it can make an impression that
biopython is slow in general, which is not true. I will try to extend
Michiel's code to accept different alphabets and then maybe phase out
the slow code of mine.

Bartek