[emboss-dev] FASTQ parsing speed in EMBOSS

Tue Jul 28 12:51:08 UTC 2009

I've retitled this and CC'ed it to the EMBOSS dev list - which is
probably a better place for this now!

On Tue, Jul 28, 2009 at 1:40 PM, Peter Rice<pmr at ebi.ac.uk> wrote:
> Peter wrote:
>> On Mon, Jul 27, 2009 at 11:44 PM, Brad Chapman wrote:
>
>>>> P.S. Anyone care to guess on how EMBOSS, BioPerl, and
>>>> Biopython's FASTQ parsing stacks up in terms of run time?
>>>
>>> We better be the fastest. Everyone knows that C code is bloated
>>> and slow.
>>
>> I pretty sure that was tongue in check, but if you were being mean
>> you probably could describe some of the EMBOSS infrastructure
>> as bloat. In any case, I'm sure that EMBOSS can be made faster
>> now that speed matters here with next generation sequencing, see:
>> http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000611.html
>
> EMBOSS code is indeed bloated and slow in some places - for example on
> output it constructs a sequence output object from the input sequence.
> However, it's C ... if we know what we're doing we can tell the machine
> to go faster. Unless the compiler decides it can optimise us away...
>
> Certainly this is a place where using reference-counted strings shows
> gains. We tend to avoid them in EMBOSS because early experience in
> optimising had them being deleted at the 'wrong' times and leaving us
> with no significant improvement in performance. Sequence output looks
> like a good place for them.
>
> We can also simplify the sequence output objects to avoid some of the
> reset operations when reusing the objects.
>
>> And I've got bad news for you then - currently EMBOSS seqret
>> is about twice as fast as CVS Biopython SeqIO (measuring parsing
>> versus writing is a bit tricky). However, I have a cunning plan:
>> http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006493.html
>
> Worse news, I can find some speedups in EMBOSS ... though
> the split is about 40% in output and 60% in input CPU time.

Well, it is only bad news from the point of view of Biopython
bragging rights ;)

And with those speed ups, I guess my fast lower level Biopython
FASTQ to FASTA script will now be about the same speed as
seqret! See:
http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006493.html

Nice work!

> I/O time is another issue where we could play with blocked
> reads ...  though when I tried that some time ago it seemed
> the operating systems and file systems were doing a grand
> job and it was hard to get a consistent speed gain even for
> one specific system.

Maybe best avoided, given EMBOSS is truly cross platform.

Peter C.