[emboss-dev] emboss-dev Digest, Vol 11, Issue 14
jitesh dundas
jbdundas at gmail.com
Tue Jul 28 21:06:43 EDT 2009
Dear Sir,
I am going to begin writing code for mak9ng parallel program execution
in Emboss.
I need someone to answer my doubts about Emboss as I am learning.
On 7/28/09, emboss-dev-request at lists.open-bio.org
<emboss-dev-request at lists.open-bio.org> wrote:
> Send emboss-dev mailing list submissions to
> emboss-dev at lists.open-bio.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://lists.open-bio.org/mailman/listinfo/emboss-dev
> or, via email, send a message with subject or body 'help' to
> emboss-dev-request at lists.open-bio.org
>
> You can reach the person managing the list at
> emboss-dev-owner at lists.open-bio.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of emboss-dev digest..."
>
>
> Today's Topics:
>
> 1. FASTQ parsing speed in EMBOSS (Peter)
> 2. Re: FASTQ parsing speed in EMBOSS (Peter Rice)
> 3. Re: FASTQ parsing speed in EMBOSS (Peter)
> 4. FASTQ parsing speed in EMBOSS (Peter)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 27 Jul 2009 18:39:49 +0100
> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: [emboss-dev] FASTQ parsing speed in EMBOSS
> To: emboss-dev at lists.open-bio.org
> Message-ID:
> <320fb6e00907271039w15ef3afcsd4a36e3ddbf001e8 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Hi all,
>
> I've been testing EMBOSS 6.1.0 with a patch from Peter Rice for
> some of the FASTQ issues I've raised, and I decided to do a few
> simple benchmarks.
>
> For this example, I have used a 1.3 GB standard Sanger FASTQ
> file from the NCBI short read archive which contains just over
> seven million short reads of length 36 bp, which I believe were
> originally from a Solexa/Illumina machine. This is actually one
> of a pair of FASTQ files as this was a paired end run. The file is
> here (compressed):
>
> ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX000/SRX000430/SRR001666_1.fastq.gz
>
> Note that some of the quality lines start with "@", so you can't
> use grep for "^@" to count the records. However, all the reads
> have an identifier starting SRR so you can do this:
>
> $ time grep "^@SRR" SRR001666_1.fastq | wc -l
> 7047668
>
> real 0m15.886s
> user 0m18.357s
> sys 0m1.268s
>
> For this example, I want to convert the FASTQ file to FASTA
> (i.e. ignore and throw away the quality scores). This is a fairly
> common task, as most all assemblers will take FASTA files,
> even if they don't understand FASTQ.
>
> As I didn't want to waste disk space and I wanted a basic
> check on the output, I have simply piped the output via
> grep and wc to count the FASTA records:
>
> $ time seqret -filter -sformat fastq-sanger -osformat fasta <
> SRR001666_1.fastq | grep "^>" | wc -l
> 7047668
>
> real 2m48.288s
> user 3m3.994s
> sys 0m3.525s
>
> I've run this several times, and this result is typical. So, using
> the "fastq-sanger" format this takes about 2m48s. There is a
> slight speed up using "fastq" as the EMBOSS input format
> name, as this never has to convert the quality strings into
> PHRED values:
>
> $ time seqret -filter -sformat fastq -osformat fasta <
> SRR001666_1.fastq | grep "^>" | wc -l
> 7047668
>
> real 2m43.566s
> user 2m59.077s
> sys 0m3.540s
>
> i.e. About 2m44, saving about 4s.
>
> Just for the record, actually doing the FASTQ to FASTA conversion
> to a file (without grep and wc) takes about 2m52s:
>
> $ time seqret -filter -sformat fastq -osformat fasta -sequence
> SRR001666_1.fastq -outseq SRR001666_1.fasta
>
> real 2m51.791s
> user 2m40.545s
> sys 0m4.848s
>
> This is over 40 thousand reads per second, but I was still a
> little disappointed in the run time. Improvements in the FASTQ
> parsing/writing speed would help get EMBOSS used in
> sequencing centre pipelines. Once we have the EMBOSS
> FASTQ input/output working as intended, does trying to
> speed it up further seem worthwhile?
>
> One specific suggestions is for the "fastq" parser (function
> seqReadFastq) which doesn't do anything with the quality
> strings. Other than for a debug statement, there is no need
> to calculate these lines:
>
> minqual = ajStrGetAsciiLow(qualstr);
> maxqual = ajStrGetAsciiHigh(qualstr);
> comqual = ajStrGetAsciiCommon(qualstr);
>
> In fact, you don't really need to record qualstr at all. Could
> you just verify the total length of the quality string, without
> actually recording it in a buffer?
>
> Another suggestion (although not demonstrated in the above
> benchmark) is for the Solexa FASTQ parsing (and output).
> >From looking at the code, you map the ASCII to a PHRED
> score for each letter of every read. This is a relatively
> expensive operation using powers and logs. I would try
> using a precomputed look up table (something I have just
> been working on for Biopython - this made a very big
> difference, especially when converting to/from Solexa
> scores to PHRED scores).
>
> Peter C.
>
>
> ------------------------------
>
> Message: 2
> Date: Tue, 28 Jul 2009 09:05:47 +0100
> From: Peter Rice <pmr at ebi.ac.uk>
> Subject: Re: [emboss-dev] FASTQ parsing speed in EMBOSS
> To: Peter <biopython at maubp.freeserve.co.uk>
> Cc: emboss-dev at lists.open-bio.org
> Message-ID: <4A6EB15B.20903 at ebi.ac.uk>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Peter wrote:
>> Hi all,
>>
>> I've been testing EMBOSS 6.1.0 with a patch from Peter Rice for
>> some of the FASTQ issues I've raised, and I decided to do a few
>> simple benchmarks.
>>
>> This is over 40 thousand reads per second, but I was still a
>> little disappointed in the run time. Improvements in the FASTQ
>> parsing/writing speed would help get EMBOSS used in
>> sequencing centre pipelines. Once we have the EMBOSS
>> FASTQ input/output working as intended, does trying to
>> speed it up further seem worthwhile?
>
> Thanks. I'll take a look. FASTQ parsing is pretty fast - in that writing
> the output takes about as long as reading the input. There may be ways to
> speed that up (output requires making an output sequence object which takes
> half the output time).
>
> Building EMBOSS with --with-gccprofile and compiling with gcc creates a
> gprof profile. Very useful for catching bottlenecks.
>
> Up to the advent of NGS data, large input/output runs have been limited to
> converting EMBL/GenBank into Fasta as a one-off every few months so looking
> into the efficiency of sequence reading/writing has been a low priority.
> Now it does assume much more importance.
>
>> Another suggestion (although not demonstrated in the above
>> benchmark) is for the Solexa FASTQ parsing (and output).
>>>From looking at the code, you map the ASCII to a PHRED
>> score for each letter of every read. This is a relatively
>> expensive operation using powers and logs. I would try
>> using a precomputed look up table (something I have just
>> been working on for Biopython - this made a very big
>> difference, especially when converting to/from Solexa
>> scores to PHRED scores).
>
> Yes, that was on my list of future changes. There wasn't time to fully
> implement and test before the release freeze.
>
> regards,
>
> Peter
>
>
> ------------------------------
>
> Message: 3
> Date: Tue, 28 Jul 2009 10:21:33 +0100
> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: Re: [emboss-dev] FASTQ parsing speed in EMBOSS
> To: Peter Rice <pmr at ebi.ac.uk>
> Cc: emboss-dev at lists.open-bio.org
> Message-ID:
> <320fb6e00907280221y141797fcw81faeefd22429fb1 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> On Tue, Jul 28, 2009 at 9:05 AM, Peter Rice<pmr at ebi.ac.uk> wrote:
>>
>> Thanks. I'll take a look. FASTQ parsing is pretty fast - in that writing
>> the
>> output takes about as long as reading the input. There may be ways to
>> speed
>> that up (output requires making an output sequence object which takes half
>> the output time).
>>
>> Building EMBOSS with --with-gccprofile and compiling with gcc creates a
>> gprof profile. Very useful for catching bottlenecks.
>
> Nice tip.
>
>> Up to the advent of NGS data, large input/output runs have been limited to
>> converting EMBL/GenBank into Fasta as a one-off every few months so
>> looking
>> into the efficiency of sequence reading/writing has been a low priority.
>> Now
>> it does assume much more importance.
>
> Exactly :)
>
>>> Another suggestion (although not demonstrated in the above
>>> benchmark) is for the Solexa FASTQ parsing (and output).
>>> From looking at the code, you map the ASCII to a PHRED
>>> score for each letter of every read. This is a relatively
>>> expensive operation using powers and logs. I would try
>>> using a precomputed look up table (something I have just
>>> been working on for Biopython - this made a very big
>>> difference, especially when converting to/from Solexa
>>> scores to PHRED scores).
>>
>> Yes, that was on my list of future changes. There wasn't time to fully
>> implement and test before the release freeze.
>
> That makes sense - and it is a pretty obvious thing to try, so
> I would have been surprised if you hadn't come up with the
> same idea.
>
> Peter
>
>
> ------------------------------
>
> Message: 4
> Date: Tue, 28 Jul 2009 13:51:08 +0100
> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: [emboss-dev] FASTQ parsing speed in EMBOSS
> To: Peter Rice <pmr at ebi.ac.uk>, emboss-dev at lists.open-bio.org
> Cc: biopython-dev at lists.open-bio.org
> Message-ID:
> <320fb6e00907280551n7a42563byb802016b2342de06 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> I've retitled this and CC'ed it to the EMBOSS dev list - which is
> probably a better place for this now!
>
> On Tue, Jul 28, 2009 at 1:40 PM, Peter Rice<pmr at ebi.ac.uk> wrote:
>> Peter wrote:
>>> On Mon, Jul 27, 2009 at 11:44 PM, Brad Chapman wrote:
>>
>>>>> P.S. Anyone care to guess on how EMBOSS, BioPerl, and
>>>>> Biopython's FASTQ parsing stacks up in terms of run time?
>>>>
>>>> We better be the fastest. Everyone knows that C code is bloated
>>>> and slow.
>>>
>>> I pretty sure that was tongue in check, but if you were being mean
>>> you probably could describe some of the EMBOSS infrastructure
>>> as bloat. In any case, I'm sure that EMBOSS can be made faster
>>> now that speed matters here with next generation sequencing, see:
>>> http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000611.html
>>
>> EMBOSS code is indeed bloated and slow in some places - for example on
>> output it constructs a sequence output object from the input sequence.
>> However, it's C ... if we know what we're doing we can tell the machine
>> to go faster. Unless the compiler decides it can optimise us away...
>>
>> Certainly this is a place where using reference-counted strings shows
>> gains. We tend to avoid them in EMBOSS because early experience in
>> optimising had them being deleted at the 'wrong' times and leaving us
>> with no significant improvement in performance. Sequence output looks
>> like a good place for them.
>>
>> We can also simplify the sequence output objects to avoid some of the
>> reset operations when reusing the objects.
>>
>>> And I've got bad news for you then - currently EMBOSS seqret
>>> is about twice as fast as CVS Biopython SeqIO (measuring parsing
>>> versus writing is a bit tricky). However, I have a cunning plan:
>>> http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006493.html
>>
>> Worse news, I can find some speedups in EMBOSS ... though
>> the split is about 40% in output and 60% in input CPU time.
>
> Well, it is only bad news from the point of view of Biopython
> bragging rights ;)
>
> And with those speed ups, I guess my fast lower level Biopython
> FASTQ to FASTA script will now be about the same speed as
> seqret! See:
> http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006493.html
>
> Nice work!
>
>> I/O time is another issue where we could play with blocked
>> reads ... though when I tried that some time ago it seemed
>> the operating systems and file systems were doing a grand
>> job and it was hard to get a consistent speed gain even for
>> one specific system.
>
> Maybe best avoided, given EMBOSS is truly cross platform.
>
> Peter C.
>
>
> ------------------------------
>
> _______________________________________________
> emboss-dev mailing list
> emboss-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss-dev
>
>
> End of emboss-dev Digest, Vol 11, Issue 14
> ******************************************
>
--
Thanks & Regards,
Jitesh Dundas
Research Associate, DIL Lab,
IIT-Bombay(www.dil.iitb.ac.in),
Scientist, Edencore Technologies(www.edencore.net)
Phone:- +91-9860925706
http://jiteshbdundas.blogspot.com
"No idea is stupid,either its too good to be true, or its way ahead of
its future"- GEORGE BERNARD SHAW.
More information about the emboss-dev
mailing list