[Bioperl-l] Next-gen modules

Wed Jun 17 17:52:49 UTC 2009

I think this is a top priority for a fall BioPerl release, maybe 1.6.2  
(I am planning on a summer 1.6.1 release still).  Made it into a bug  
report for tracking:

http://bugzilla.open-bio.org/show_bug.cgi?id=2857

If no one works on this I may take it up after the 1.6.1 release.

chris

On Jun 17, 2009, at 12:13 PM, Mark A. Jensen wrote:

> I'm on the case! (but maybe not in realtime, today!)
>
> ----- Original Message ----- From: "Chris Fields" <cjfields at illinois.edu 
> >
> To: "Peter" <biopython at maubp.freeserve.co.uk>
> Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>; "Elia Stupka" <e.stupka at ucl.ac.uk 
> >; "Heikki Lehvaslaiho" <heikki at sanbi.ac.za>
> Sent: Wednesday, June 17, 2009 1:06 PM
> Subject: Re: [Bioperl-l] Next-gen modules
>
>
>>
>> On Jun 17, 2009, at 8:25 AM, Peter wrote:
>>
>>> On Wed, Jun 17, 2009 at 1:57 PM, Chris  
>>> Fields<cjfields at illinois.edu>  wrote:
>>>>
>>>> Elia,
>>>>
>>>> As Mark indicated, we recently discussed the lack of support for   
>>>> next-gen on
>>>> list, at least re: fastq.  I may be hit with the same thing in a   
>>>> few months
>>>> time myself, and I recall Jason and a few others also mentioning   
>>>> the same.
>>>> Heikki wrote some code for Illumina FASTQ for SeqIO and related   
>>>> modules but
>>>> I don't believe it has been committed to trunk yet, so maybe he  
>>>> can  answer.
>>>>
>>>> From prior discussions IIRC the issues were:
>>>>
>>>> 1) distinguishing the various FASTQ versions (Sanger, Illumina  
>>>> 1.0, Illumina
>>>> 1.3) from one another (so maybe some optional validation), and
>>>
>>> Following the python rule of thumb for being explicit, Biopython  
>>> makes
>>> the user specify which FASTQ variant is being used. I don't think  
>>> you
>>> can do anything else. Any attempted validation would have to be
>>> heuristic based on the ASCII characters found, and would risk false
>>> positive warnings.
>>
>> Right; I'm thinking along the same lines.  If anything the most we   
>> would allow is some level of validation, so if there were a degree  
>> of  uncertainty about the format one could set a validation flag to  
>> check  bounds during the parse and warn if they are exceeded.
>>
>>>> 2) having a way for the Seq object to either 'know' what format is
>>>> contained, or we use phred score and convert back and forth from   
>>>> that (I
>>>> think the latter makes more sense).
>>>
>>> I think it could make sense for BioPerl to convert Solexa scores  
>>> to/ from
>>> PHRED scores on the fly (especially now that Illumina is abandoning
>>> the Solexa score system). Python style tries to avoid implicit   
>>> conversions,
>>> so Biopython doesn't automatically do a conversion from Solexa to
>>> PHRED scores on parsing (but will on writing if the requested output
>>> format requires this).
>>>
>>>> Peter's suggestions also are reasonable, though does biopython  
>>>> have a
>>>> separate module for each of these variations?  Our version (I   
>>>> believe)
>>>> mainly varied the conversion within Bio::SeqIO::fastq itself  
>>>> based  on the
>>>> fastq variant passed in as a separate named argument.
>>>
>>> Biopython's SeqIO gives the three FASTQ variants their own unique
>>> names. This format name is a required argument for parsing/writing
>>> (we don't try and guess the file format from the data contents).   
>>> Internally
>>> we have three separate FASTQ parsers/writers although they do share
>>> code.
>>
>> We could easily do the same if others agree.  Actually, if we   
>> specified that shorthand for a variant on a format would be  
>> designated  as -format => 'format-variant', I think we could easily  
>> hack SeqIO to  deal with that by splitting on '-' and passing  
>> everything to the  constructor as (-format => 'format', -variant =>  
>> 'variant').  Very  little repeated code in this case, just an  
>> additional named parameter  indicating the format variant (and the  
>> SeqIO class can do the type  checking on that within the  
>> constructor).
>>
>>> Other issues to keep in mind:
>>>
>>> (3) There should be no warning parsing files where the optional   
>>> repeated
>>> title is missing on the "+" lines (as discussed earlier on the   
>>> BioPerl list).
>>
>> Agreed, though we'll have to check the current fastq parser to see  
>> if  that's currently the case.  I thought that was fixed but maybe  
>> not?
>>
>>> (4) When writing FASTQ files should BioPerl omit the optional  
>>> repeated
>>> title on the "+" line? Biopython omits this as I understand this  
>>> to be
>>> common practice, and can make a big different to file sizes -   
>>> especially
>>> on short read data from Solexa/Illumina.
>>
>> Agreed, particularly if it's commonly encountered.
>>
>>> (5) Also test reading and writing files with an optional  
>>> description  (as well
>>> as an identifier) on the "@" (and "+") lines. See the NCBI SRA  
>>> for  examples,
>>> e.g.
>>>
>>> @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
>>> GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
>>> +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
>>> IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
>>
>> Should be easy enough to implement with a simple regex.
>>
>>> (6) Test reading and writing files where the encoded quality  
>>> string  starts
>>> with a "@" or a "+" character, e.g.
>>> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html
>>>
>>> Peter
>>
>> Mark, getting all that? ;>
>>
>> chris
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l