[Bioperl-l] Next-gen modules

Wed Jun 17 18:39:08 UTC 2009

We are using bioperl for simple pre and post-processing of data for  
full Solexa runs, and although it might not be ideal, the scripting  
with Bioperl is not a major killer. When I was referring to large,  
heavy pipelines I was thinking of pipelines that deal with many Solexa  
runs as one project (e.g. 1000 genomes) who really cannot afford any  
bottleneck in their pipelines, because that affects directly their  
storage.

cheers

Elia

On 17 Jun 2009, at 19:09, Tristan Lefebure wrote:

> Thanks both for the light.
>
> That probably means that the place bioperl will take in the
> handling of the next-gen sequencing raw data (i.e. reads) is
> very limited, nope? (at least until bioperl6). A single GA2
> solexa lane generates about 9 million reads, and I would
> really not called that a big project...
>
> BTW, is there a simple way to see object instantiation and
> inheritance, as well as time consumption for each, when once
> calls next_seq() (or any other method)?
>
> -Tristan
>
> On Wednesday 17 June 2009 13:49:38 Elia Stupka wrote:
>> I would suggest developing the "standard" version first,
>> then moving onto potential optimizations.
>>
>> When we went through a similar argument in Ensembl about
>> 8 years ago we ended up dropping Bio::Root completely...
>>
>> If one is truly after performance for these large
>> next-gen projects, it'd be down to pure piping, shell,
>> and worrying about location and copying of files,
>> sticking to systems-level as much as possible, and quite
>> far from Bioperl altogether, so I think it's a whole
>> different level of optimization issues, probably outside
>> the scope of Bioperl.
>>
>> Elia
>>
>> On 17 Jun 2009, at 18:09, Chris Fields wrote:
>>> On Jun 17, 2009, at 8:27 AM, Tristan Lefebure wrote:
>>>> Hello,
>>>> Regarding next-gen sequences and bioperl, following my
>>>> experience, another issue is bioperl speed. For
>>>> example, if you want to trim bad quality bases at ends
>>>> of 1E6 Solexa reads using Bio::SeqIO::fastq and some
>>>> methods in Bio::Seq::Quality, well, you've got to be
>>>> patient (but may be I missed some shortcuts...).
>>>
>>> The key issues affecting speed in bioperl are contained
>>> object instantiation and inheritance (and between those
>>> two, the latter much more so as it plays a role with
>>> contained objects as well as the container).
>>>
>>> http://www.bioperl.org/wiki/Why_BioPerl_is_slow
>>>
>>> Moose/Perl6 roles/traits are one way around that issue,
>>> but we are a ways off from getting that running.  I
>>> think to get that working decently would be a
>>> from-ground-up endeavor (see my past posts on
>>> biomoose/bioperl6).
>>>
>>>> A pure perl solution will be between 100 to 1000x
>>>> faster... Would it be possible to have an ultra-light
>>>> quality object with few simple methods for next-gen
>>>> reads?
>>>>
>>>> I can contribute some tests if that sounds like an
>>>> important point.
>>>>
>>>> -Tristan
>>>
>>> The quality objects themselves I don't think are that
>>> heavy; I think the main impediment is inheritance.  One
>>> could get around that a bit by using a direct_new
>>> method to create a blessed hash directly, then
>>> reimplement methods to lazily create any objects
>>> contained on the fly.
>>>
>>> chris
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> ---
>> Senior Lecturer, Bioinformatics
>> UCL Cancer Institute
>> Paul O' Gorman Building
>> University College London
>> Gower Street
>> WC1E 6BT
>> London
>> UK
>>
>> Office (UCL): +44 207 679 6493
>> Office (ICMS): +44 0207 8822374
>>
>> Mobile: +44 7597 566 194
>> Mobile (Italy): +39 338 8448801
>
>

---
Senior Lecturer, Bioinformatics
UCL Cancer Institute
Paul O' Gorman Building
University College London
Gower Street
WC1E 6BT
London
UK

Office (UCL): +44 207 679 6493
Office (ICMS): +44 0207 8822374

Mobile: +44 7597 566 194
Mobile (Italy): +39 338 8448801