[Bioperl-l] Next-gen modules

Elia Stupka e.stupka at ucl.ac.uk
Sat Jun 20 16:12:18 EDT 2009


Hi Chris,

I agree. I have not written a single line of code so far, while Heikki  
has some (but has been silent for a while) and you have perhaps some  
code ready to roll. I am happy to help where needed, just let me know  
what you'd like me to focus on. If you want to go ahead and implement  
the fastq staff discussed I can focus on bioperl-run.

cheers

Elia



On 19 Jun 2009, at 21:57, Chris Fields wrote:

> So, to follow up (and make sure we don't have any overlapping tuits)  
> we should probably determine who wants to work on what (i.e. fastq  
> updating, etc). I think it's possible to quickly add in Solexa/ 
> Illumina/Sanger fastq similar to BioPython, just don't want to step  
> on anyone's toes if they are halfway through doing this.
>
> chris
>
> On Jun 17, 2009, at 3:36 PM, Elia Stupka wrote:
>
>> Better than colorspaced discussions for sure ;)
>>
>> Elia
>>
>> On 17 Jun 2009, at 21:35, Chris Fields wrote:
>>
>>> So, #1 priority is to get fastq up-to-speed, then maybe assess  
>>> other options.
>>>
>>> Illuminating discussion, thanks Elia!
>>>
>>> urgh, excuse unintended bad pun above...
>>>
>>> chris
>>>
>>> On Jun 17, 2009, at 3:06 PM, Elia Stupka wrote:
>>>
>>>> Interesting that you mention the database issue. We found that  
>>>> for specific memory/CPU intenstive things we also switch to using  
>>>> dbs. For example, after many years of loyal use of  
>>>> disconnected_ranges we switched to a simple SQL implementation of  
>>>> it, because of the large performance gains it would give us.   
>>>> Similarly in Ensembl as well as in the old days of bioperl-db we  
>>>> opted for doing subseq within SQL where possible.
>>>>
>>>> Some lean way of SQL'izing specific components could be less  
>>>> "disruptive" than avoiding object creation and provide  
>>>> significant gains in performance. Could be set as an optional  
>>>> flag, and could use temporary ad hoc SQL databases?
>>>>
>>>> Still, priority now is to make SeqIO compliant with all those  
>>>> formats, than we can worry about performance :)
>>>>
>>>> Elia
>>>>
>>>> On 17 Jun 2009, at 20:30, Chris Fields wrote:
>>>>
>>>>> On Jun 17, 2009, at 1:20 PM, Sendu Bala wrote:
>>>>>
>>>>>> Tristan Lefebure wrote:
>>>>>>> Hello,
>>>>>>> Regarding next-gen sequences and bioperl, following my  
>>>>>>> experience, another issue is bioperl speed. For example, if  
>>>>>>> you want to trim bad quality bases at ends of 1E6 Solexa reads  
>>>>>>> using Bio::SeqIO::fastq and some methods in Bio::Seq::Quality,  
>>>>>>> well, you've got to be patient (but may be I missed some  
>>>>>>> shortcuts...).
>>>>>>
>>>>>> This is my concern as well. Or, rather, is there actually a  
>>>>>> significant set of users out there who are dealing with next- 
>>>>>> gen sequencing and would consider using BioPerl for their work?
>>>>>>
>>>>>> I'm working with all the 1000-genomes data at the Sanger, and  
>>>>>> we at least are probably never going to use BioPerl for the work.
>>>>>
>>>>> Are you using pure perl or (gasp) something else?  ;>
>>>>>
>>>>> Judging by the feedback there are definitely a set of users who  
>>>>> would like to integrate nextgen into bioperl somehow, probably  
>>>>> to take advantage of other aspects of bioperl.
>>>>>
>>>>>>> A pure perl solution will be between 100 to 1000x faster...  
>>>>>>> Would it be possible to have an ultra-light quality object  
>>>>>>> with few simple methods for next-gen reads?
>>>>>>
>>>>>> The fastq parser itself already seems pretty fast. The way to  
>>>>>> get the speedup is to not create any Bio::Seq* objects but just  
>>>>>> return the data directly. At that point it's not taking much  
>>>>>> advantage of BioPerl. But certainly it could be done...
>>>>>
>>>>>
>>>>> I suppose the best way to assess what needs to be done is come  
>>>>> up with a set of 'use cases' specifying what users want so we  
>>>>> can design around them, otherwise we're shooting in the dark.
>>>>>
>>>>> I'm personally wondering if this could be done as a sequence  
>>>>> database, something similar in theme to Lincoln's  
>>>>> SeqFeature::Store, but sequence only, and returns quality  
>>>>> objects in a similar manner (ala Storable)?  Not sure whether  
>>>>> that's feasible, but it's appears at least scalable.
>>>>>
>>>>> chris
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>> ---
>>>> Senior Lecturer, Bioinformatics
>>>> UCL Cancer Institute
>>>> Paul O' Gorman Building
>>>> University College London
>>>> Gower Street
>>>> WC1E 6BT
>>>> London
>>>> UK
>>>>
>>>> Office (UCL): +44 207 679 6493
>>>> Office (ICMS): +44 0207 8822374
>>>>
>>>> Mobile: +44 7597 566 194
>>>> Mobile (Italy): +39 338 8448801
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>> ---
>> Senior Lecturer, Bioinformatics
>> UCL Cancer Institute
>> Paul O' Gorman Building
>> University College London
>> Gower Street
>> WC1E 6BT
>> London
>> UK
>>
>> Office (UCL): +44 207 679 6493
>> Office (ICMS): +44 0207 8822374
>>
>> Mobile: +44 7597 566 194
>> Mobile (Italy): +39 338 8448801
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>

---
Senior Lecturer, Bioinformatics
UCL Cancer Institute
Paul O' Gorman Building
University College London
Gower Street
WC1E 6BT
London
UK

Office (UCL): +44 207 679 6493
Office (ICMS): +44 0207 8822374

Mobile: +44 7597 566 194
Mobile (Italy): +39 338 8448801



More information about the Bioperl-l mailing list