[Bioperl-l] Next-gen modules
Chris Fields
cjfields at illinois.edu
Wed Jun 17 16:35:38 EDT 2009
So, #1 priority is to get fastq up-to-speed, then maybe assess other
options.
Illuminating discussion, thanks Elia!
urgh, excuse unintended bad pun above...
chris
On Jun 17, 2009, at 3:06 PM, Elia Stupka wrote:
> Interesting that you mention the database issue. We found that for
> specific memory/CPU intenstive things we also switch to using dbs.
> For example, after many years of loyal use of disconnected_ranges we
> switched to a simple SQL implementation of it, because of the large
> performance gains it would give us. Similarly in Ensembl as well as
> in the old days of bioperl-db we opted for doing subseq within SQL
> where possible.
>
> Some lean way of SQL'izing specific components could be less
> "disruptive" than avoiding object creation and provide significant
> gains in performance. Could be set as an optional flag, and could
> use temporary ad hoc SQL databases?
>
> Still, priority now is to make SeqIO compliant with all those
> formats, than we can worry about performance :)
>
> Elia
>
> On 17 Jun 2009, at 20:30, Chris Fields wrote:
>
>> On Jun 17, 2009, at 1:20 PM, Sendu Bala wrote:
>>
>>> Tristan Lefebure wrote:
>>>> Hello,
>>>> Regarding next-gen sequences and bioperl, following my
>>>> experience, another issue is bioperl speed. For example, if you
>>>> want to trim bad quality bases at ends of 1E6 Solexa reads using
>>>> Bio::SeqIO::fastq and some methods in Bio::Seq::Quality, well,
>>>> you've got to be patient (but may be I missed some shortcuts...).
>>>
>>> This is my concern as well. Or, rather, is there actually a
>>> significant set of users out there who are dealing with next-gen
>>> sequencing and would consider using BioPerl for their work?
>>>
>>> I'm working with all the 1000-genomes data at the Sanger, and we
>>> at least are probably never going to use BioPerl for the work.
>>
>> Are you using pure perl or (gasp) something else? ;>
>>
>> Judging by the feedback there are definitely a set of users who
>> would like to integrate nextgen into bioperl somehow, probably to
>> take advantage of other aspects of bioperl.
>>
>>>> A pure perl solution will be between 100 to 1000x faster... Would
>>>> it be possible to have an ultra-light quality object with few
>>>> simple methods for next-gen reads?
>>>
>>> The fastq parser itself already seems pretty fast. The way to get
>>> the speedup is to not create any Bio::Seq* objects but just return
>>> the data directly. At that point it's not taking much advantage of
>>> BioPerl. But certainly it could be done...
>>
>>
>> I suppose the best way to assess what needs to be done is come up
>> with a set of 'use cases' specifying what users want so we can
>> design around them, otherwise we're shooting in the dark.
>>
>> I'm personally wondering if this could be done as a sequence
>> database, something similar in theme to Lincoln's
>> SeqFeature::Store, but sequence only, and returns quality objects
>> in a similar manner (ala Storable)? Not sure whether that's
>> feasible, but it's appears at least scalable.
>>
>> chris
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> ---
> Senior Lecturer, Bioinformatics
> UCL Cancer Institute
> Paul O' Gorman Building
> University College London
> Gower Street
> WC1E 6BT
> London
> UK
>
> Office (UCL): +44 207 679 6493
> Office (ICMS): +44 0207 8822374
>
> Mobile: +44 7597 566 194
> Mobile (Italy): +39 338 8448801
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list