[Bioperl-l] Next-gen modules
Mark A. Jensen
maj at fortinbras.us
Wed Jun 17 17:13:23 UTC 2009
I'm on the case! (but maybe not in realtime, today!)
----- Original Message -----
From: "Chris Fields" <cjfields at illinois.edu>
To: "Peter" <biopython at maubp.freeserve.co.uk>
Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>; "Elia Stupka"
<e.stupka at ucl.ac.uk>; "Heikki Lehvaslaiho" <heikki at sanbi.ac.za>
Sent: Wednesday, June 17, 2009 1:06 PM
Subject: Re: [Bioperl-l] Next-gen modules
>
> On Jun 17, 2009, at 8:25 AM, Peter wrote:
>
>> On Wed, Jun 17, 2009 at 1:57 PM, Chris Fields<cjfields at illinois.edu> wrote:
>>>
>>> Elia,
>>>
>>> As Mark indicated, we recently discussed the lack of support for next-gen
>>> on
>>> list, at least re: fastq. I may be hit with the same thing in a few months
>>> time myself, and I recall Jason and a few others also mentioning the same.
>>> Heikki wrote some code for Illumina FASTQ for SeqIO and related modules
>>> but
>>> I don't believe it has been committed to trunk yet, so maybe he can answer.
>>>
>>> From prior discussions IIRC the issues were:
>>>
>>> 1) distinguishing the various FASTQ versions (Sanger, Illumina 1.0,
>>> Illumina
>>> 1.3) from one another (so maybe some optional validation), and
>>
>> Following the python rule of thumb for being explicit, Biopython makes
>> the user specify which FASTQ variant is being used. I don't think you
>> can do anything else. Any attempted validation would have to be
>> heuristic based on the ASCII characters found, and would risk false
>> positive warnings.
>
> Right; I'm thinking along the same lines. If anything the most we would
> allow is some level of validation, so if there were a degree of uncertainty
> about the format one could set a validation flag to check bounds during the
> parse and warn if they are exceeded.
>
>>> 2) having a way for the Seq object to either 'know' what format is
>>> contained, or we use phred score and convert back and forth from that (I
>>> think the latter makes more sense).
>>
>> I think it could make sense for BioPerl to convert Solexa scores to/ from
>> PHRED scores on the fly (especially now that Illumina is abandoning
>> the Solexa score system). Python style tries to avoid implicit conversions,
>> so Biopython doesn't automatically do a conversion from Solexa to
>> PHRED scores on parsing (but will on writing if the requested output
>> format requires this).
>>
>>> Peter's suggestions also are reasonable, though does biopython have a
>>> separate module for each of these variations? Our version (I believe)
>>> mainly varied the conversion within Bio::SeqIO::fastq itself based on the
>>> fastq variant passed in as a separate named argument.
>>
>> Biopython's SeqIO gives the three FASTQ variants their own unique
>> names. This format name is a required argument for parsing/writing
>> (we don't try and guess the file format from the data contents). Internally
>> we have three separate FASTQ parsers/writers although they do share
>> code.
>
> We could easily do the same if others agree. Actually, if we specified that
> shorthand for a variant on a format would be designated as -format =>
> 'format-variant', I think we could easily hack SeqIO to deal with that by
> splitting on '-' and passing everything to the constructor as (-format =>
> 'format', -variant => 'variant'). Very little repeated code in this case,
> just an additional named parameter indicating the format variant (and the
> SeqIO class can do the type checking on that within the constructor).
>
>> Other issues to keep in mind:
>>
>> (3) There should be no warning parsing files where the optional repeated
>> title is missing on the "+" lines (as discussed earlier on the BioPerl
>> list).
>
> Agreed, though we'll have to check the current fastq parser to see if that's
> currently the case. I thought that was fixed but maybe not?
>
>> (4) When writing FASTQ files should BioPerl omit the optional repeated
>> title on the "+" line? Biopython omits this as I understand this to be
>> common practice, and can make a big different to file sizes - especially
>> on short read data from Solexa/Illumina.
>
> Agreed, particularly if it's commonly encountered.
>
>> (5) Also test reading and writing files with an optional description (as
>> well
>> as an identifier) on the "@" (and "+") lines. See the NCBI SRA for examples,
>> e.g.
>>
>> @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
>> GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
>> +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
>> IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
>
> Should be easy enough to implement with a simple regex.
>
>> (6) Test reading and writing files where the encoded quality string starts
>> with a "@" or a "+" character, e.g.
>> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html
>>
>> Peter
>
> Mark, getting all that? ;>
>
> chris
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
More information about the Bioperl-l
mailing list