[Bioperl-l] Next-gen modules

Wed Jun 17 09:25:59 EDT 2009

On Wed, Jun 17, 2009 at 1:57 PM, Chris Fields<cjfields at illinois.edu> wrote:
>
> Elia,
>
> As Mark indicated, we recently discussed the lack of support for next-gen on
> list, at least re: fastq.  I may be hit with the same thing in a few months
> time myself, and I recall Jason and a few others also mentioning the same.
>  Heikki wrote some code for Illumina FASTQ for SeqIO and related modules but
> I don't believe it has been committed to trunk yet, so maybe he can answer.
>
> From prior discussions IIRC the issues were:
>
> 1) distinguishing the various FASTQ versions (Sanger, Illumina 1.0, Illumina
> 1.3) from one another (so maybe some optional validation), and

Following the python rule of thumb for being explicit, Biopython makes
the user specify which FASTQ variant is being used. I don't think you
can do anything else. Any attempted validation would have to be
heuristic based on the ASCII characters found, and would risk false
positive warnings.

> 2) having a way for the Seq object to either 'know' what format is
> contained, or we use phred score and convert back and forth from that (I
> think the latter makes more sense).

I think it could make sense for BioPerl to convert Solexa scores to/from
PHRED scores on the fly (especially now that Illumina is abandoning
the Solexa score system). Python style tries to avoid implicit conversions,
so Biopython doesn't automatically do a conversion from Solexa to
PHRED scores on parsing (but will on writing if the requested output
format requires this).

> Peter's suggestions also are reasonable, though does biopython have a
> separate module for each of these variations?  Our version (I believe)
> mainly varied the conversion within Bio::SeqIO::fastq itself based on the
> fastq variant passed in as a separate named argument.

Biopython's SeqIO gives the three FASTQ variants their own unique
names. This format name is a required argument for parsing/writing
(we don't try and guess the file format from the data contents). Internally
we have three separate FASTQ parsers/writers although they do share
code.

Other issues to keep in mind:

(3) There should be no warning parsing files where the optional repeated
title is missing on the "+" lines (as discussed earlier on the BioPerl list).

(4) When writing FASTQ files should BioPerl omit the optional repeated
title on the "+" line? Biopython omits this as I understand this to be
common practice, and can make a big different to file sizes - especially
on short read data from Solexa/Illumina.

(5) Also test reading and writing files with an optional description (as well
as an identifier) on the "@" (and "+") lines. See the NCBI SRA for examples,
e.g.

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

(6) Test reading and writing files where the encoded quality string starts
with a "@" or a "+" character, e.g.
http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html

Peter