[Bioperl-l] Next-gen modules
Chris Fields
cjfields at illinois.edu
Wed Jun 17 18:38:26 EDT 2009
On Jun 17, 2009, at 5:10 PM, Sendu Bala wrote:
> Chris Fields wrote:
>> On Jun 17, 2009, at 1:20 PM, Sendu Bala wrote:
>>> Tristan Lefebure wrote:
>>>> Hello,
>>>> Regarding next-gen sequences and bioperl, following my
>>>> experience, another issue is bioperl speed. For example, if you
>>>> want to trim bad quality bases at ends of 1E6 Solexa reads using
>>>> Bio::SeqIO::fastq and some methods in Bio::Seq::Quality, well,
>>>> you've got to be patient (but may be I missed some shortcuts...).
>>>
>>> This is my concern as well. Or, rather, is there actually a
>>> significant set of users out there who are dealing with next-gen
>>> sequencing and would consider using BioPerl for their work?
>>>
>>> I'm working with all the 1000-genomes data at the Sanger, and we
>>> at least are probably never going to use BioPerl for the work.
>> Are you using pure perl or (gasp) something else? ;>
>
> We use some perl stuff, some C stuff. My own stuff is OO perl, but
> much lighter weight than BioPerl. Absolute minimal object creation.
Makes sense.
>>>> A pure perl solution will be between 100 to 1000x faster... Would
>>>> it be possible to have an ultra-light quality object with few
>>>> simple methods for next-gen reads?
>>>
>>> The fastq parser itself already seems pretty fast. The way to get
>>> the speedup is to not create any Bio::Seq* objects but just return
>>> the data directly. At that point it's not taking much advantage of
>>> BioPerl. But certainly it could be done...
>> I suppose the best way to assess what needs to be done is come up
>> with a set of 'use cases' specifying what users want so we can
>> design around them, otherwise we're shooting in the dark.
>
> Indeed. Though at least I think we can all agree it would be nice to
> have the functionality there even if it's slow. There will always be
> at least some use-cases where the run speed doesn't matter.
Agreed.
>> I'm personally wondering if this could be done as a sequence
>> database, something similar in theme to Lincoln's
>> SeqFeature::Store, but sequence only, and returns quality objects
>> in a similar manner (ala Storable)? Not sure whether that's
>> feasible, but it's appears at least scalable.
>
> I think not. Well, at least SeqFeature::Store doesn't scale. Try
> storing millions of features in a database and watch it crawl to
> complete unusability. I can't imagine a db scaling to holding
> hundreds of TB of data either. I'm also not sure what the benefit
> is. There are already high-speed ways of indexing your fastq or bam
> files.
Interesting that you ran into issues with SF::Store; wonder if object
storage is the limiting factor there, or if it is something else.
Anyone else having this issue?
chris
More information about the Bioperl-l
mailing list