[Biopython-dev] Next release plans?

Peter Rice pmr at ebi.ac.uk
Tue Jul 28 08:40:43 EDT 2009


Peter wrote:
> On Mon, Jul 27, 2009 at 11:44 PM, Brad Chapman<chapmanb at 50mail.com> wrote:

>>> P.S. Anyone care to guess on how EMBOSS, BioPerl, and
>>> Biopython's FASTQ parsing stacks up in terms of run time?
>>
>> We better be the fastest. Everyone knows that C code is bloated
>> and slow.
> 
> I pretty sure that was tongue in check, but if you were being mean
> you probably could describe some of the EMBOSS infrastructure
> as bloat. In any case, I'm sure that EMBOSS can be made faster
> now that speed matters here with next generation sequencing, see:
> http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000611.html

EMBOSS code is indeed bloated and slow in some places - for example on
output it constructs a sequence output object from the input sequence.
However, it's C ... if we know what we're doing we can tell the machine
to go faster. Unless the compiler decides it can optimise us away...

Certainly this is a place where using reference-counted strings shows
gains. We tend to avoid them in EMBOSS because early experience in
optimising had them being deleted at the 'wrong' times and leaving us
with no significant improvement in performance. Sequence output looks
like a good place for them.

We can also simplify the sequence output objects to avoid some of the
reset operations when reusing the objects.

> And I've got bad news for you then - currently EMBOSS seqret
> is about twice as fast as CVS Biopython SeqIO (measuring parsing
> versus writing is a bit tricky). However, I have a cunning plan:
> http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006493.html

Worse news, I can find some speedups in EMBOSS ... though the split is
about 40% in output and 60% in input CPU time.

I/O time is another issue where we could play with blocked reads ...
though when I tried that some time ago it seemed the operating systems
and file systems were doing a grand job and it was hard to get a
consistent speed gain even for one specific system.

regards,

Peter Rice


More information about the Biopython-dev mailing list