[Bioperl-l] Next-gen modules

Wed Jun 17 17:13:23 UTC 2009

I'm on the case! (but maybe not in realtime, today!)

----- Original Message ----- 
From: "Chris Fields" <cjfields at illinois.edu>
To: "Peter" <biopython at maubp.freeserve.co.uk>
Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>; "Elia Stupka" 
<e.stupka at ucl.ac.uk>; "Heikki Lehvaslaiho" <heikki at sanbi.ac.za>
Sent: Wednesday, June 17, 2009 1:06 PM
Subject: Re: [Bioperl-l] Next-gen modules

>
> On Jun 17, 2009, at 8:25 AM, Peter wrote:
>
>> On Wed, Jun 17, 2009 at 1:57 PM, Chris Fields<cjfields at illinois.edu>  wrote:
>>>
>>> Elia,
>>>
>>> As Mark indicated, we recently discussed the lack of support for  next-gen 
>>> on
>>> list, at least re: fastq.  I may be hit with the same thing in a  few months
>>> time myself, and I recall Jason and a few others also mentioning  the same.
>>>  Heikki wrote some code for Illumina FASTQ for SeqIO and related  modules 
>>> but
>>> I don't believe it has been committed to trunk yet, so maybe he can  answer.
>>>
>>> From prior discussions IIRC the issues were:
>>>
>>> 1) distinguishing the various FASTQ versions (Sanger, Illumina 1.0, 
>>> Illumina
>>> 1.3) from one another (so maybe some optional validation), and
>>
>> Following the python rule of thumb for being explicit, Biopython makes
>> the user specify which FASTQ variant is being used. I don't think you
>> can do anything else. Any attempted validation would have to be
>> heuristic based on the ASCII characters found, and would risk false
>> positive warnings.
>
> Right; I'm thinking along the same lines.  If anything the most we  would 
> allow is some level of validation, so if there were a degree of  uncertainty 
> about the format one could set a validation flag to check  bounds during the 
> parse and warn if they are exceeded.
>
>>> 2) having a way for the Seq object to either 'know' what format is
>>> contained, or we use phred score and convert back and forth from  that (I
>>> think the latter makes more sense).
>>
>> I think it could make sense for BioPerl to convert Solexa scores to/ from
>> PHRED scores on the fly (especially now that Illumina is abandoning
>> the Solexa score system). Python style tries to avoid implicit  conversions,
>> so Biopython doesn't automatically do a conversion from Solexa to
>> PHRED scores on parsing (but will on writing if the requested output
>> format requires this).
>>
>>> Peter's suggestions also are reasonable, though does biopython have a
>>> separate module for each of these variations?  Our version (I  believe)
>>> mainly varied the conversion within Bio::SeqIO::fastq itself based  on the
>>> fastq variant passed in as a separate named argument.
>>
>> Biopython's SeqIO gives the three FASTQ variants their own unique
>> names. This format name is a required argument for parsing/writing
>> (we don't try and guess the file format from the data contents).  Internally
>> we have three separate FASTQ parsers/writers although they do share
>> code.
>
> We could easily do the same if others agree.  Actually, if we  specified that 
> shorthand for a variant on a format would be designated  as -format => 
> 'format-variant', I think we could easily hack SeqIO to  deal with that by 
> splitting on '-' and passing everything to the  constructor as (-format => 
> 'format', -variant => 'variant').  Very  little repeated code in this case, 
> just an additional named parameter  indicating the format variant (and the 
> SeqIO class can do the type  checking on that within the constructor).
>
>> Other issues to keep in mind:
>>
>> (3) There should be no warning parsing files where the optional  repeated
>> title is missing on the "+" lines (as discussed earlier on the  BioPerl 
>> list).
>
> Agreed, though we'll have to check the current fastq parser to see if  that's 
> currently the case.  I thought that was fixed but maybe not?
>
>> (4) When writing FASTQ files should BioPerl omit the optional repeated
>> title on the "+" line? Biopython omits this as I understand this to be
>> common practice, and can make a big different to file sizes -  especially
>> on short read data from Solexa/Illumina.
>
> Agreed, particularly if it's commonly encountered.
>
>> (5) Also test reading and writing files with an optional description  (as 
>> well
>> as an identifier) on the "@" (and "+") lines. See the NCBI SRA for  examples,
>> e.g.
>>
>> @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
>> GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
>> +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
>> IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
>
> Should be easy enough to implement with a simple regex.
>
>> (6) Test reading and writing files where the encoded quality string  starts
>> with a "@" or a "+" character, e.g.
>> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html
>>
>> Peter
>
> Mark, getting all that? ;>
>
> chris
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>