[Bioperl-l] Next-gen modules
Elia Stupka
e.stupka at ucl.ac.uk
Wed Jun 17 14:01:28 EDT 2009
If we reach a consensus on how/who/what, I will be happy to contribute
some coding time in the coming days.
Would it be a good starting point to start adding the different
formats as named in BioPython, and test support for reading/wrting
them? I could start playing with that.
regards,
Elia
On 17 Jun 2009, at 18:52, Chris Fields wrote:
> I think this is a top priority for a fall BioPerl release, maybe
> 1.6.2 (I am planning on a summer 1.6.1 release still). Made it into
> a bug report for tracking:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2857
>
> If no one works on this I may take it up after the 1.6.1 release.
>
> chris
>
> On Jun 17, 2009, at 12:13 PM, Mark A. Jensen wrote:
>
>> I'm on the case! (but maybe not in realtime, today!)
>>
>> ----- Original Message ----- From: "Chris Fields" <cjfields at illinois.edu
>> >
>> To: "Peter" <biopython at maubp.freeserve.co.uk>
>> Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>; "Elia Stupka" <e.stupka at ucl.ac.uk
>> >; "Heikki Lehvaslaiho" <heikki at sanbi.ac.za>
>> Sent: Wednesday, June 17, 2009 1:06 PM
>> Subject: Re: [Bioperl-l] Next-gen modules
>>
>>
>>>
>>> On Jun 17, 2009, at 8:25 AM, Peter wrote:
>>>
>>>> On Wed, Jun 17, 2009 at 1:57 PM, Chris
>>>> Fields<cjfields at illinois.edu> wrote:
>>>>>
>>>>> Elia,
>>>>>
>>>>> As Mark indicated, we recently discussed the lack of support
>>>>> for next-gen on
>>>>> list, at least re: fastq. I may be hit with the same thing in
>>>>> a few months
>>>>> time myself, and I recall Jason and a few others also
>>>>> mentioning the same.
>>>>> Heikki wrote some code for Illumina FASTQ for SeqIO and related
>>>>> modules but
>>>>> I don't believe it has been committed to trunk yet, so maybe he
>>>>> can answer.
>>>>>
>>>>> From prior discussions IIRC the issues were:
>>>>>
>>>>> 1) distinguishing the various FASTQ versions (Sanger, Illumina
>>>>> 1.0, Illumina
>>>>> 1.3) from one another (so maybe some optional validation), and
>>>>
>>>> Following the python rule of thumb for being explicit, Biopython
>>>> makes
>>>> the user specify which FASTQ variant is being used. I don't think
>>>> you
>>>> can do anything else. Any attempted validation would have to be
>>>> heuristic based on the ASCII characters found, and would risk false
>>>> positive warnings.
>>>
>>> Right; I'm thinking along the same lines. If anything the most
>>> we would allow is some level of validation, so if there were a
>>> degree of uncertainty about the format one could set a validation
>>> flag to check bounds during the parse and warn if they are
>>> exceeded.
>>>
>>>>> 2) having a way for the Seq object to either 'know' what format is
>>>>> contained, or we use phred score and convert back and forth
>>>>> from that (I
>>>>> think the latter makes more sense).
>>>>
>>>> I think it could make sense for BioPerl to convert Solexa scores
>>>> to/ from
>>>> PHRED scores on the fly (especially now that Illumina is abandoning
>>>> the Solexa score system). Python style tries to avoid implicit
>>>> conversions,
>>>> so Biopython doesn't automatically do a conversion from Solexa to
>>>> PHRED scores on parsing (but will on writing if the requested
>>>> output
>>>> format requires this).
>>>>
>>>>> Peter's suggestions also are reasonable, though does biopython
>>>>> have a
>>>>> separate module for each of these variations? Our version (I
>>>>> believe)
>>>>> mainly varied the conversion within Bio::SeqIO::fastq itself
>>>>> based on the
>>>>> fastq variant passed in as a separate named argument.
>>>>
>>>> Biopython's SeqIO gives the three FASTQ variants their own unique
>>>> names. This format name is a required argument for parsing/writing
>>>> (we don't try and guess the file format from the data contents).
>>>> Internally
>>>> we have three separate FASTQ parsers/writers although they do share
>>>> code.
>>>
>>> We could easily do the same if others agree. Actually, if we
>>> specified that shorthand for a variant on a format would be
>>> designated as -format => 'format-variant', I think we could
>>> easily hack SeqIO to deal with that by splitting on '-' and
>>> passing everything to the constructor as (-format => 'format', -
>>> variant => 'variant'). Very little repeated code in this case,
>>> just an additional named parameter indicating the format variant
>>> (and the SeqIO class can do the type checking on that within the
>>> constructor).
>>>
>>>> Other issues to keep in mind:
>>>>
>>>> (3) There should be no warning parsing files where the optional
>>>> repeated
>>>> title is missing on the "+" lines (as discussed earlier on the
>>>> BioPerl list).
>>>
>>> Agreed, though we'll have to check the current fastq parser to see
>>> if that's currently the case. I thought that was fixed but maybe
>>> not?
>>>
>>>> (4) When writing FASTQ files should BioPerl omit the optional
>>>> repeated
>>>> title on the "+" line? Biopython omits this as I understand this
>>>> to be
>>>> common practice, and can make a big different to file sizes -
>>>> especially
>>>> on short read data from Solexa/Illumina.
>>>
>>> Agreed, particularly if it's commonly encountered.
>>>
>>>> (5) Also test reading and writing files with an optional
>>>> description (as well
>>>> as an identifier) on the "@" (and "+") lines. See the NCBI SRA
>>>> for examples,
>>>> e.g.
>>>>
>>>> @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
>>>> GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
>>>> +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
>>>> IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
>>>
>>> Should be easy enough to implement with a simple regex.
>>>
>>>> (6) Test reading and writing files where the encoded quality
>>>> string starts
>>>> with a "@" or a "+" character, e.g.
>>>> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html
>>>>
>>>> Peter
>>>
>>> Mark, getting all that? ;>
>>>
>>> chris
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
---
Senior Lecturer, Bioinformatics
UCL Cancer Institute
Paul O' Gorman Building
University College London
Gower Street
WC1E 6BT
London
UK
Office (UCL): +44 207 679 6493
Office (ICMS): +44 0207 8822374
Mobile: +44 7597 566 194
Mobile (Italy): +39 338 8448801
More information about the Bioperl-l
mailing list