[Bioperl-l] Next-gen modules

Wed Jun 17 13:06:44 EDT 2009

On Jun 17, 2009, at 8:25 AM, Peter wrote:

> On Wed, Jun 17, 2009 at 1:57 PM, Chris Fields<cjfields at illinois.edu>  
> wrote:
>>
>> Elia,
>>
>> As Mark indicated, we recently discussed the lack of support for  
>> next-gen on
>> list, at least re: fastq.  I may be hit with the same thing in a  
>> few months
>> time myself, and I recall Jason and a few others also mentioning  
>> the same.
>>  Heikki wrote some code for Illumina FASTQ for SeqIO and related  
>> modules but
>> I don't believe it has been committed to trunk yet, so maybe he can  
>> answer.
>>
>> From prior discussions IIRC the issues were:
>>
>> 1) distinguishing the various FASTQ versions (Sanger, Illumina 1.0,  
>> Illumina
>> 1.3) from one another (so maybe some optional validation), and
>
> Following the python rule of thumb for being explicit, Biopython makes
> the user specify which FASTQ variant is being used. I don't think you
> can do anything else. Any attempted validation would have to be
> heuristic based on the ASCII characters found, and would risk false
> positive warnings.

Right; I'm thinking along the same lines.  If anything the most we  
would allow is some level of validation, so if there were a degree of  
uncertainty about the format one could set a validation flag to check  
bounds during the parse and warn if they are exceeded.

>> 2) having a way for the Seq object to either 'know' what format is
>> contained, or we use phred score and convert back and forth from  
>> that (I
>> think the latter makes more sense).
>
> I think it could make sense for BioPerl to convert Solexa scores to/ 
> from
> PHRED scores on the fly (especially now that Illumina is abandoning
> the Solexa score system). Python style tries to avoid implicit  
> conversions,
> so Biopython doesn't automatically do a conversion from Solexa to
> PHRED scores on parsing (but will on writing if the requested output
> format requires this).
>
>> Peter's suggestions also are reasonable, though does biopython have a
>> separate module for each of these variations?  Our version (I  
>> believe)
>> mainly varied the conversion within Bio::SeqIO::fastq itself based  
>> on the
>> fastq variant passed in as a separate named argument.
>
> Biopython's SeqIO gives the three FASTQ variants their own unique
> names. This format name is a required argument for parsing/writing
> (we don't try and guess the file format from the data contents).  
> Internally
> we have three separate FASTQ parsers/writers although they do share
> code.

We could easily do the same if others agree.  Actually, if we  
specified that shorthand for a variant on a format would be designated  
as -format => 'format-variant', I think we could easily hack SeqIO to  
deal with that by splitting on '-' and passing everything to the  
constructor as (-format => 'format', -variant => 'variant').  Very  
little repeated code in this case, just an additional named parameter  
indicating the format variant (and the SeqIO class can do the type  
checking on that within the constructor).

> Other issues to keep in mind:
>
> (3) There should be no warning parsing files where the optional  
> repeated
> title is missing on the "+" lines (as discussed earlier on the  
> BioPerl list).

Agreed, though we'll have to check the current fastq parser to see if  
that's currently the case.  I thought that was fixed but maybe not?

> (4) When writing FASTQ files should BioPerl omit the optional repeated
> title on the "+" line? Biopython omits this as I understand this to be
> common practice, and can make a big different to file sizes -  
> especially
> on short read data from Solexa/Illumina.

Agreed, particularly if it's commonly encountered.

> (5) Also test reading and writing files with an optional description  
> (as well
> as an identifier) on the "@" (and "+") lines. See the NCBI SRA for  
> examples,
> e.g.
>
> @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
> GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
> +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
> IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

Should be easy enough to implement with a simple regex.

> (6) Test reading and writing files where the encoded quality string  
> starts
> with a "@" or a "+" character, e.g.
> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html
>
> Peter

Mark, getting all that? ;>

chris