[Bioperl-l] Next-Gen and the next point release - updates

Wed Aug 26 17:16:08 EDT 2009

On Wed, Aug 26, 2009 at 9:36 PM, Chris Fields<cjfields at illinois.edu> wrote:
> All,
>
> I just pushed one very key bit for nextgen sequence analysis to svn, mainly
> parsing of all three FASTQ variants.  These can be called by using:
>
>  # grabs the FASTQ parser, specifies the Illumina variant
>  my $in = Bio::SeqIO->new(-format    => 'fastq-illumina',
>                           -file      => 'mydata.fq');
>
>  # same, explicitly specifies the Illumina variant
>  my $in = Bio::SeqIO->new(-format    => 'fastq',
>                           -variant   => 'illumina',
>                           -file      => 'mydata.fq');
>
>  # simple 'fastq' format defaults to 'sanger' variant
>  my $out = Bio::SeqIO->new(-format    => 'fastq',
>                           -file      => '>mydata.fq');
>
> FASTQ works for both input and output.  As mentioned before, the
> next_dataset() method also exists for getting simple hashrefs, see the
> module documentation for more.
>
> This was one of the few remaining blockers for the 1.6.1 point release.
> ...  If everything looks fine the final point release will follow soon after.

It is looking much better than yesterday - nice work :)
However, there are a few rough edges still.

===========================
Evil wrapping
===========================
Chris - Did you get the zip file of FASTQ examples I sent off list? One of
these was the evil_wrapping.fastq file already in Biopython CVS/git (under
a new name). This is intended as a real torture test, with line wrapped
quality strings where plenty of the lines start with "+" or "@" characters.
Bioperl doesn't like this file at all - but I have not dug into why.

===========================
Sanger To Illumina 1.3+
===========================
When mapping a Sanger FASTQ file with very high scores to Illumina,
these don't get the maximum value imposes (ASCII 126, tidle). e.g.

$ ./biopython_sanger2illumina < sanger_93.fastq
/usr/local/lib/python2.6/dist-packages/Bio/SeqIO/QualityIO.py:676:
UserWarning: Data loss - max PHRED quality 62 in Illumina FASTQ
  warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ")
@Test PHRED qualities from 93 to 0 inclusive
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@

But, with bioperl-live SVN,

$ ./bioperl_sanger2illumina < sanger_93.fastq

--------------------- WARNING ---------------------
MSG: Quality values not found for
illumina:63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93
---------------------------------------------------
@Test PHRED qualities from 93 to 0 inclusive
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN
+
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@

You are using "@" (ASCI 64), which in this context means a PHRED score of zero.

===========================
Sanger To Solexa
===========================
Likewise when mapping a Sanger FASTQ file with very high scores to
Solexa FASTQ, these don't get the maximum value imposes (ASCII 126,
tidle). For example,

$ ./biopython_sanger2solexa < sanger_93.fastq
/usr/local/lib/python2.6/dist-packages/Bio/SeqIO/QualityIO.py:764:
UserWarning: Data loss - max Solexa quality 62 in Solexa FASTQ
  warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ")
@Test PHRED qualities from 93 to 0 inclusive
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFECB@>;;

But,

$ ./bioperl_sanger2solexa < sanger_93.fastq

--------------------- WARNING ---------------------
MSG: Quality values not found for
solexa:0,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93
---------------------------------------------------
@Test PHRED qualities from 93 to 0 inclusive
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN
+
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFEDB@><<

i.e. You've mapped the high value scores to "<", ASCII 60, thus Solexa -4
(an odd thing to happen - getting the lowest score wouldn't surprise me so
much).

Furthermore, notice that PHRED scores 0 and 1 have both been mapped
to "<", ASCII 60, thus Solexa -4, and not ";" ASCII 59 meaning Solexa -5.

===========================

Still, things are looking up :)

Peter