[Biopython-dev] [Biopython] MOODS: fast search for position weight matrix matches in DNA sequences.

Mon Oct 12 18:08:26 UTC 2009

Hi all,

On Thu, Sep 24, 2009 at 2:51 PM, Bartek Wilczynski
<bartek at rezolwenta.eu.org> wrote:
> On Thu, Sep 24, 2009 at 2:27 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>>
>> A separate news post mentioning the C option speed and showing usage
>> examples from both is a great idea. Responsiveness to new methods is
>> the fun part of science.
>>
> I'll try to write that up and send it to the list.

This took me, unfortunately more than I thought it would...

The reasons are partially non-related (finishing a paper) and
partially related to the matter.

To put it short, my original plan was to include a wrapper for MOODS
as a patch to biopython (if it is in the system -> use it) and include
that information in this blog post. However, as I performed more tests
of MOODS, I found out, that it might not be such a great idea. While
the C module written for biopython by Michiel is working like a
breeze, the MOODS package is a bit more moody... I needed to tweak the
makefile to compile it on my mac, but it was working (most of the
time) afterwards. Then I wanted to try on my linux box where it
compiled with no problems, but it was giving me segfaults on my
scripts which ran fine on a mac (it did run the simple examples
though...). In addition to that, I found that the performance of MOODS
was not always better than that of the brute force algorithm, which is
already in Biopython. At the same time the maintainability of the
Michiel's code is incomparable with the complex stuff they have.

In conclusion, I don't think it is worth to put too much into the
integration efforts now. I will try to contact the MOODS team about
the issues I encountered and see whether they are interested in
getting it integrated into biopython. If so, I can try to help with
this, but it might be that we will just provide a function for getting
a properly formatted log-odds matrix from Biopython motif for usage in
MOODS. After all I think that not that many applications require the
performance gains of MOODS over our current implemmentation.

For the purpose of the blog post I've written a short script:
http://github.com/barwil/biopython/blob/df0dfa8feeb15ce50d027d1492913f2d8920c9b3/Tests/Motif/moods_motif_benchmark.py

which can be run assuming you have at least biopython 1.51+ installed
and MOODS python bindings.
It uses two different motifs showing possible behavior of
Bio.Motif._pwm and MOODS. The output from my machine is as follows:

reading the sequence took  0.603 seconds
First motif: SRF
MOODS calculation took  0.768 seconds on average
Bio.Motif fast calculation took  2.407 seconds on average
Second motif: Broad complex II
MOODS calculation took  5.72 seconds on average
Bio.Motif fast calculation took  2.687 seconds on average

The averages are calculated from 10 runs, and they do not change
substantially across different executions.

I've made a biopython branch including this script and the additional
function in Bio.Motif (for extracting log-odds in MOODS compatible
format).

I've also drafted a blog post, but i would greatly appreciate any help
from people who are more skilled in writing. There's how it goes:

"""
In a recent article, Janne Korhonen et al. (Bioinformatics, 2009)
introduce a new fast software library for finding motif occurences in
DNA sequences. They also compare performance of their tool with
currently available solutions from Bioperl and Biopython.

Unfortunately, biopython is the only tool in the comparison whose
performance is measured based on a solution written in an interpreted
language, while both MOODS and bioperl are written in compiled
languages (C++ and C, respectively). This, not surprisingly, shows
biopython as by far the slowest of the three. Since the authors made
their comparisons, however, we have moved on, and thanks to the C code
contributed by Michiel de Hoon and included in the 1.51 release
Biopython's motif finding library improved greatly and is performing
comparably to the MOODS package.

The results of a quick benchmark script
(http://github.com/barwil/biopython/blob/df0dfa8feeb15ce50d027d1492913f2d8920c9b3/Tests/Motif/moods_motif_benchmark.py)
indicate that a simple algorithm implemented in C is able to scan a
whole chromosome (>23Mb) in less than 3s for a typical DNA motif.
Depending on a motif, the advanced linear algorithm from MOODS package
can decrease (or in some cases even increase) this running time by a
few seconds.

"""

It sounds quite dull to me, so I would greatly appreciate ideas on
improving the text and making it less formal and boring...

Cheers
Bartek