[Bioperl-l] Next-gen modules
Chris Fields
cjfields at illinois.edu
Wed Jun 17 17:52:49 UTC 2009
I think this is a top priority for a fall BioPerl release, maybe 1.6.2
(I am planning on a summer 1.6.1 release still). Made it into a bug
report for tracking:
http://bugzilla.open-bio.org/show_bug.cgi?id=2857
If no one works on this I may take it up after the 1.6.1 release.
chris
On Jun 17, 2009, at 12:13 PM, Mark A. Jensen wrote:
> I'm on the case! (but maybe not in realtime, today!)
>
> ----- Original Message ----- From: "Chris Fields" <cjfields at illinois.edu
> >
> To: "Peter" <biopython at maubp.freeserve.co.uk>
> Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>; "Elia Stupka" <e.stupka at ucl.ac.uk
> >; "Heikki Lehvaslaiho" <heikki at sanbi.ac.za>
> Sent: Wednesday, June 17, 2009 1:06 PM
> Subject: Re: [Bioperl-l] Next-gen modules
>
>
>>
>> On Jun 17, 2009, at 8:25 AM, Peter wrote:
>>
>>> On Wed, Jun 17, 2009 at 1:57 PM, Chris
>>> Fields<cjfields at illinois.edu> wrote:
>>>>
>>>> Elia,
>>>>
>>>> As Mark indicated, we recently discussed the lack of support for
>>>> next-gen on
>>>> list, at least re: fastq. I may be hit with the same thing in a
>>>> few months
>>>> time myself, and I recall Jason and a few others also mentioning
>>>> the same.
>>>> Heikki wrote some code for Illumina FASTQ for SeqIO and related
>>>> modules but
>>>> I don't believe it has been committed to trunk yet, so maybe he
>>>> can answer.
>>>>
>>>> From prior discussions IIRC the issues were:
>>>>
>>>> 1) distinguishing the various FASTQ versions (Sanger, Illumina
>>>> 1.0, Illumina
>>>> 1.3) from one another (so maybe some optional validation), and
>>>
>>> Following the python rule of thumb for being explicit, Biopython
>>> makes
>>> the user specify which FASTQ variant is being used. I don't think
>>> you
>>> can do anything else. Any attempted validation would have to be
>>> heuristic based on the ASCII characters found, and would risk false
>>> positive warnings.
>>
>> Right; I'm thinking along the same lines. If anything the most we
>> would allow is some level of validation, so if there were a degree
>> of uncertainty about the format one could set a validation flag to
>> check bounds during the parse and warn if they are exceeded.
>>
>>>> 2) having a way for the Seq object to either 'know' what format is
>>>> contained, or we use phred score and convert back and forth from
>>>> that (I
>>>> think the latter makes more sense).
>>>
>>> I think it could make sense for BioPerl to convert Solexa scores
>>> to/ from
>>> PHRED scores on the fly (especially now that Illumina is abandoning
>>> the Solexa score system). Python style tries to avoid implicit
>>> conversions,
>>> so Biopython doesn't automatically do a conversion from Solexa to
>>> PHRED scores on parsing (but will on writing if the requested output
>>> format requires this).
>>>
>>>> Peter's suggestions also are reasonable, though does biopython
>>>> have a
>>>> separate module for each of these variations? Our version (I
>>>> believe)
>>>> mainly varied the conversion within Bio::SeqIO::fastq itself
>>>> based on the
>>>> fastq variant passed in as a separate named argument.
>>>
>>> Biopython's SeqIO gives the three FASTQ variants their own unique
>>> names. This format name is a required argument for parsing/writing
>>> (we don't try and guess the file format from the data contents).
>>> Internally
>>> we have three separate FASTQ parsers/writers although they do share
>>> code.
>>
>> We could easily do the same if others agree. Actually, if we
>> specified that shorthand for a variant on a format would be
>> designated as -format => 'format-variant', I think we could easily
>> hack SeqIO to deal with that by splitting on '-' and passing
>> everything to the constructor as (-format => 'format', -variant =>
>> 'variant'). Very little repeated code in this case, just an
>> additional named parameter indicating the format variant (and the
>> SeqIO class can do the type checking on that within the
>> constructor).
>>
>>> Other issues to keep in mind:
>>>
>>> (3) There should be no warning parsing files where the optional
>>> repeated
>>> title is missing on the "+" lines (as discussed earlier on the
>>> BioPerl list).
>>
>> Agreed, though we'll have to check the current fastq parser to see
>> if that's currently the case. I thought that was fixed but maybe
>> not?
>>
>>> (4) When writing FASTQ files should BioPerl omit the optional
>>> repeated
>>> title on the "+" line? Biopython omits this as I understand this
>>> to be
>>> common practice, and can make a big different to file sizes -
>>> especially
>>> on short read data from Solexa/Illumina.
>>
>> Agreed, particularly if it's commonly encountered.
>>
>>> (5) Also test reading and writing files with an optional
>>> description (as well
>>> as an identifier) on the "@" (and "+") lines. See the NCBI SRA
>>> for examples,
>>> e.g.
>>>
>>> @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
>>> GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
>>> +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
>>> IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
>>
>> Should be easy enough to implement with a simple regex.
>>
>>> (6) Test reading and writing files where the encoded quality
>>> string starts
>>> with a "@" or a "+" character, e.g.
>>> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html
>>>
>>> Peter
>>
>> Mark, getting all that? ;>
>>
>> chris
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list