[Bioperl-l] Next-gen modules

Wed Jun 17 19:30:15 UTC 2009

On Jun 17, 2009, at 1:20 PM, Sendu Bala wrote:

> Tristan Lefebure wrote:
>> Hello,
>> Regarding next-gen sequences and bioperl, following my experience,  
>> another issue is bioperl speed. For example, if you want to trim  
>> bad quality bases at ends of 1E6 Solexa reads using  
>> Bio::SeqIO::fastq and some methods in Bio::Seq::Quality, well,  
>> you've got to be patient (but may be I missed some shortcuts...).
>
> This is my concern as well. Or, rather, is there actually a  
> significant set of users out there who are dealing with next-gen  
> sequencing and would consider using BioPerl for their work?
>
> I'm working with all the 1000-genomes data at the Sanger, and we at  
> least are probably never going to use BioPerl for the work.

Are you using pure perl or (gasp) something else?  ;>

Judging by the feedback there are definitely a set of users who would  
like to integrate nextgen into bioperl somehow, probably to take  
advantage of other aspects of bioperl.

>> A pure perl solution will be between 100 to 1000x faster... Would  
>> it be possible to have an ultra-light quality object with few  
>> simple methods for next-gen reads?
>
> The fastq parser itself already seems pretty fast. The way to get  
> the speedup is to not create any Bio::Seq* objects but just return  
> the data directly. At that point it's not taking much advantage of  
> BioPerl. But certainly it could be done...

I suppose the best way to assess what needs to be done is come up with  
a set of 'use cases' specifying what users want so we can design  
around them, otherwise we're shooting in the dark.

I'm personally wondering if this could be done as a sequence database,  
something similar in theme to Lincoln's SeqFeature::Store, but  
sequence only, and returns quality objects in a similar manner (ala  
Storable)?  Not sure whether that's feasible, but it's appears at  
least scalable.

chris