[Bioperl-l] Next-gen modules

Wed Jun 17 16:35:38 EDT 2009

So, #1 priority is to get fastq up-to-speed, then maybe assess other  
options.

Illuminating discussion, thanks Elia!

urgh, excuse unintended bad pun above...

chris

On Jun 17, 2009, at 3:06 PM, Elia Stupka wrote:

> Interesting that you mention the database issue. We found that for  
> specific memory/CPU intenstive things we also switch to using dbs.  
> For example, after many years of loyal use of disconnected_ranges we  
> switched to a simple SQL implementation of it, because of the large  
> performance gains it would give us.  Similarly in Ensembl as well as  
> in the old days of bioperl-db we opted for doing subseq within SQL  
> where possible.
>
> Some lean way of SQL'izing specific components could be less  
> "disruptive" than avoiding object creation and provide significant  
> gains in performance. Could be set as an optional flag, and could  
> use temporary ad hoc SQL databases?
>
> Still, priority now is to make SeqIO compliant with all those  
> formats, than we can worry about performance :)
>
> Elia
>
> On 17 Jun 2009, at 20:30, Chris Fields wrote:
>
>> On Jun 17, 2009, at 1:20 PM, Sendu Bala wrote:
>>
>>> Tristan Lefebure wrote:
>>>> Hello,
>>>> Regarding next-gen sequences and bioperl, following my  
>>>> experience, another issue is bioperl speed. For example, if you  
>>>> want to trim bad quality bases at ends of 1E6 Solexa reads using  
>>>> Bio::SeqIO::fastq and some methods in Bio::Seq::Quality, well,  
>>>> you've got to be patient (but may be I missed some shortcuts...).
>>>
>>> This is my concern as well. Or, rather, is there actually a  
>>> significant set of users out there who are dealing with next-gen  
>>> sequencing and would consider using BioPerl for their work?
>>>
>>> I'm working with all the 1000-genomes data at the Sanger, and we  
>>> at least are probably never going to use BioPerl for the work.
>>
>> Are you using pure perl or (gasp) something else?  ;>
>>
>> Judging by the feedback there are definitely a set of users who  
>> would like to integrate nextgen into bioperl somehow, probably to  
>> take advantage of other aspects of bioperl.
>>
>>>> A pure perl solution will be between 100 to 1000x faster... Would  
>>>> it be possible to have an ultra-light quality object with few  
>>>> simple methods for next-gen reads?
>>>
>>> The fastq parser itself already seems pretty fast. The way to get  
>>> the speedup is to not create any Bio::Seq* objects but just return  
>>> the data directly. At that point it's not taking much advantage of  
>>> BioPerl. But certainly it could be done...
>>
>>
>> I suppose the best way to assess what needs to be done is come up  
>> with a set of 'use cases' specifying what users want so we can  
>> design around them, otherwise we're shooting in the dark.
>>
>> I'm personally wondering if this could be done as a sequence  
>> database, something similar in theme to Lincoln's  
>> SeqFeature::Store, but sequence only, and returns quality objects  
>> in a similar manner (ala Storable)?  Not sure whether that's  
>> feasible, but it's appears at least scalable.
>>
>> chris
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> ---
> Senior Lecturer, Bioinformatics
> UCL Cancer Institute
> Paul O' Gorman Building
> University College London
> Gower Street
> WC1E 6BT
> London
> UK
>
> Office (UCL): +44 207 679 6493
> Office (ICMS): +44 0207 8822374
>
> Mobile: +44 7597 566 194
> Mobile (Italy): +39 338 8448801
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l