[Bioperl-l] Next-gen modules

Wed Jun 17 18:38:26 EDT 2009

On Jun 17, 2009, at 5:10 PM, Sendu Bala wrote:

> Chris Fields wrote:
>> On Jun 17, 2009, at 1:20 PM, Sendu Bala wrote:
>>> Tristan Lefebure wrote:
>>>> Hello,
>>>> Regarding next-gen sequences and bioperl, following my  
>>>> experience, another issue is bioperl speed. For example, if you  
>>>> want to trim bad quality bases at ends of 1E6 Solexa reads using  
>>>> Bio::SeqIO::fastq and some methods in Bio::Seq::Quality, well,  
>>>> you've got to be patient (but may be I missed some shortcuts...).
>>>
>>> This is my concern as well. Or, rather, is there actually a  
>>> significant set of users out there who are dealing with next-gen  
>>> sequencing and would consider using BioPerl for their work?
>>>
>>> I'm working with all the 1000-genomes data at the Sanger, and we  
>>> at least are probably never going to use BioPerl for the work.
>> Are you using pure perl or (gasp) something else?  ;>
>
> We use some perl stuff, some C stuff. My own stuff is OO perl, but  
> much lighter weight than BioPerl. Absolute minimal object creation.

Makes sense.

>>>> A pure perl solution will be between 100 to 1000x faster... Would  
>>>> it be possible to have an ultra-light quality object with few  
>>>> simple methods for next-gen reads?
>>>
>>> The fastq parser itself already seems pretty fast. The way to get  
>>> the speedup is to not create any Bio::Seq* objects but just return  
>>> the data directly. At that point it's not taking much advantage of  
>>> BioPerl. But certainly it could be done...
>> I suppose the best way to assess what needs to be done is come up  
>> with a set of 'use cases' specifying what users want so we can  
>> design around them, otherwise we're shooting in the dark.
>
> Indeed. Though at least I think we can all agree it would be nice to  
> have the functionality there even if it's slow. There will always be  
> at least some use-cases where the run speed doesn't matter.

Agreed.

>> I'm personally wondering if this could be done as a sequence  
>> database, something similar in theme to Lincoln's  
>> SeqFeature::Store, but sequence only, and returns quality objects  
>> in a similar manner (ala Storable)?  Not sure whether that's  
>> feasible, but it's appears at least scalable.
>
> I think not. Well, at least SeqFeature::Store doesn't scale. Try  
> storing millions of features in a database and watch it crawl to  
> complete unusability. I can't imagine a db scaling to holding  
> hundreds of TB of data either. I'm also not sure what the benefit  
> is. There are already high-speed ways of indexing your fastq or bam  
> files.

Interesting that you ran into issues with SF::Store; wonder if object  
storage is the limiting factor there, or if it is something else.  
Anyone else having this issue?

chris