[Bioperl-l] Next-gen modules
Chris Fields
cjfields at illinois.edu
Mon Jun 22 20:29:46 UTC 2009
On Jun 22, 2009, at 9:24 AM, Peter wrote:
> On Wed, Jun 17, 2009 at 6:06 PM, Chris Fields wrote:
>> Peter wrote:
>>> Other issues to keep in mind:
>>>
>>> (3) There should be no warning parsing files where the optional
>>> repeated
>>> title is missing on the "+" lines (as discussed earlier on the
>>> BioPerl
>>> list).
>>
>> Agreed, though we'll have to check the current fastq parser to see
>> if that's
>> currently the case. I thought that was fixed but maybe not?
>>
>>> (4) When writing FASTQ files should BioPerl omit the optional
>>> repeated
>>> title on the "+" line? Biopython omits this as I understand this
>>> to be
>>> common practice, and can make a big different to file sizes -
>>> especially
>>> on short read data from Solexa/Illumina.
>>
>> Agreed, particularly if it's commonly encountered.
>>
>>> (5) Also test reading and writing files with an optional
>>> description (as
>>> well as an identifier) on the "@" (and "+") lines. See the NCBI SRA
>>> for examples, e.g.
>>>
>>> @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
>>> GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
>>> +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
>>> IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
>>
>> Should be easy enough to implement with a simple regex.
>>
>>> (6) Test reading and writing files where the encoded quality
>>> string starts
>>> with a "@" or a "+" character, e.g.
>>> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html
>>>
>>> Peter
>>
>> Mark, getting all that? ;>
>>
>> chris
>
> Another couple of points that I should have remembered earlier,
> related to converting between PHRED scores and Solexa scores.
> On the bright side, with Illumina abandoning the Solexa scores
> in pipeline 1.3+, these issues will go away with time:
>
> (7) If BioPerl will be converting Solexa scores to/from PHRED
> scores as integers automatically (as discussed earlier), make
> sure you round to the nearest whole number (don't just truncate
> with a call to int!). MAQ does this by adding 0.5 before calling
> int (while in Biopython I just use Python's round function).
That can probably be done with sprintf if needed. It avoids a call to
POSIX functions.
> (8) When asked to write out an old Solexa style FASTQ file,
> what will you do if given a standard Sanger FASTQ file (or a
> new Illumina 1.3+ FASTQ file) containing a base with PHRED
> quality zero? This maps to a Solexa quality of minus infinity...
> Right now the development version of Biopython will throw an
> error in this situation, but mapping to the lowest observed
> Solexa score might be reasonable.
>
> Peter
Maybe address with a warning followed by assigning to the lowest
solexa score?
chris
More information about the Bioperl-l
mailing list