[Bioperl-l] Next-gen modules
Chris Fields
cjfields at illinois.edu
Wed Jun 17 13:06:44 EDT 2009
On Jun 17, 2009, at 8:25 AM, Peter wrote:
> On Wed, Jun 17, 2009 at 1:57 PM, Chris Fields<cjfields at illinois.edu>
> wrote:
>>
>> Elia,
>>
>> As Mark indicated, we recently discussed the lack of support for
>> next-gen on
>> list, at least re: fastq. I may be hit with the same thing in a
>> few months
>> time myself, and I recall Jason and a few others also mentioning
>> the same.
>> Heikki wrote some code for Illumina FASTQ for SeqIO and related
>> modules but
>> I don't believe it has been committed to trunk yet, so maybe he can
>> answer.
>>
>> From prior discussions IIRC the issues were:
>>
>> 1) distinguishing the various FASTQ versions (Sanger, Illumina 1.0,
>> Illumina
>> 1.3) from one another (so maybe some optional validation), and
>
> Following the python rule of thumb for being explicit, Biopython makes
> the user specify which FASTQ variant is being used. I don't think you
> can do anything else. Any attempted validation would have to be
> heuristic based on the ASCII characters found, and would risk false
> positive warnings.
Right; I'm thinking along the same lines. If anything the most we
would allow is some level of validation, so if there were a degree of
uncertainty about the format one could set a validation flag to check
bounds during the parse and warn if they are exceeded.
>> 2) having a way for the Seq object to either 'know' what format is
>> contained, or we use phred score and convert back and forth from
>> that (I
>> think the latter makes more sense).
>
> I think it could make sense for BioPerl to convert Solexa scores to/
> from
> PHRED scores on the fly (especially now that Illumina is abandoning
> the Solexa score system). Python style tries to avoid implicit
> conversions,
> so Biopython doesn't automatically do a conversion from Solexa to
> PHRED scores on parsing (but will on writing if the requested output
> format requires this).
>
>> Peter's suggestions also are reasonable, though does biopython have a
>> separate module for each of these variations? Our version (I
>> believe)
>> mainly varied the conversion within Bio::SeqIO::fastq itself based
>> on the
>> fastq variant passed in as a separate named argument.
>
> Biopython's SeqIO gives the three FASTQ variants their own unique
> names. This format name is a required argument for parsing/writing
> (we don't try and guess the file format from the data contents).
> Internally
> we have three separate FASTQ parsers/writers although they do share
> code.
We could easily do the same if others agree. Actually, if we
specified that shorthand for a variant on a format would be designated
as -format => 'format-variant', I think we could easily hack SeqIO to
deal with that by splitting on '-' and passing everything to the
constructor as (-format => 'format', -variant => 'variant'). Very
little repeated code in this case, just an additional named parameter
indicating the format variant (and the SeqIO class can do the type
checking on that within the constructor).
> Other issues to keep in mind:
>
> (3) There should be no warning parsing files where the optional
> repeated
> title is missing on the "+" lines (as discussed earlier on the
> BioPerl list).
Agreed, though we'll have to check the current fastq parser to see if
that's currently the case. I thought that was fixed but maybe not?
> (4) When writing FASTQ files should BioPerl omit the optional repeated
> title on the "+" line? Biopython omits this as I understand this to be
> common practice, and can make a big different to file sizes -
> especially
> on short read data from Solexa/Illumina.
Agreed, particularly if it's commonly encountered.
> (5) Also test reading and writing files with an optional description
> (as well
> as an identifier) on the "@" (and "+") lines. See the NCBI SRA for
> examples,
> e.g.
>
> @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
> GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
> +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
> IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
Should be easy enough to implement with a simple regex.
> (6) Test reading and writing files where the encoded quality string
> starts
> with a "@" or a "+" character, e.g.
> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html
>
> Peter
Mark, getting all that? ;>
chris
More information about the Bioperl-l
mailing list