[Bioperl-l] FASTQ, was Re: BioPerl long-term, was Re: dependencies on perl version

Wed Feb 6 22:53:21 UTC 2013

On Feb 6, 2013, at 4:43 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Wed, Feb 6, 2013 at 10:11 PM, Fields, Christopher J
> <cjfields at illinois.edu> wrote:
>> 
>> I see no problem in stating any generic parsing and low-level interfaces
>> are just as much a part of what BioPerl encompasses as the higher-level
>> Bio::* classes themselves.  Steve and Jason were on to something with
>> SearchIO; it's maybe not as performant as we would like, but it certainly
>> is more flexible in terms of what can be done, b/c it separates out
>> low-level parsing from object creation.  That's the general model we
>> should look at.  There is a good reason Biopython is following this
>> model with their SearchIO implementation (Peter C, are you reading this?)
> 
> Actually I don't think we did end up with that kind of separation in the
> Biopython SearchIO - which is not so say it isn't an excellent model
> to follow. Rather the Biopython SearchIO (like the BioPerl one) had
> as the first goal a consistent object model across assorted file
> formats.
> 
> The idea of a low level minimal overhead parsers (which are very
> format specific), on which a heavier but consistent object model
> can be built might be a good balance - the high level API has the
> connivence, but if you give that up you can have more speed.
> That's what I recommend with FASTQ and Biopython, e.g.
> http://news.open-bio.org/news/2009/09/biopython-fast-fastq/
> 
>> 
>> I have started a wrapper around Heng's FASTQ/FASTA parsing
>> code (kseq), it seems to work quite well (~20M FASTQ in 30 sec
>> last I recall?).
>> 
> 
> I'd have to dig through my emails, but I think the BioRuby guys
> looked at that too - as I recall while it was fast, the error handling
> left something to be desired. Email me directly or on the BioRuby
> list if you want to follow up on that.
> 
> Regards,
> 
> Peter

I did a little on this, worth following up on, but I pulled the FASTQ test examples you created from the paper to test it out.  IIRC it parsed where it needed to, but I'm not sure how it handled bad sequences, so yes, worth looking into.  Maybe worth moving to open-bio-l for broader discussion.

chris