[Bioperl-l] Next-Gen and the next point release - updates
Peter
biopython at maubp.freeserve.co.uk
Wed Aug 26 17:16:08 EDT 2009
On Wed, Aug 26, 2009 at 9:36 PM, Chris Fields<cjfields at illinois.edu> wrote:
> All,
>
> I just pushed one very key bit for nextgen sequence analysis to svn, mainly
> parsing of all three FASTQ variants. These can be called by using:
>
> # grabs the FASTQ parser, specifies the Illumina variant
> my $in = Bio::SeqIO->new(-format => 'fastq-illumina',
> -file => 'mydata.fq');
>
> # same, explicitly specifies the Illumina variant
> my $in = Bio::SeqIO->new(-format => 'fastq',
> -variant => 'illumina',
> -file => 'mydata.fq');
>
> # simple 'fastq' format defaults to 'sanger' variant
> my $out = Bio::SeqIO->new(-format => 'fastq',
> -file => '>mydata.fq');
>
> FASTQ works for both input and output. As mentioned before, the
> next_dataset() method also exists for getting simple hashrefs, see the
> module documentation for more.
>
> This was one of the few remaining blockers for the 1.6.1 point release.
> ... If everything looks fine the final point release will follow soon after.
It is looking much better than yesterday - nice work :)
However, there are a few rough edges still.
===========================
Evil wrapping
===========================
Chris - Did you get the zip file of FASTQ examples I sent off list? One of
these was the evil_wrapping.fastq file already in Biopython CVS/git (under
a new name). This is intended as a real torture test, with line wrapped
quality strings where plenty of the lines start with "+" or "@" characters.
Bioperl doesn't like this file at all - but I have not dug into why.
===========================
Sanger To Illumina 1.3+
===========================
When mapping a Sanger FASTQ file with very high scores to Illumina,
these don't get the maximum value imposes (ASCII 126, tidle). e.g.
$ ./biopython_sanger2illumina < sanger_93.fastq
/usr/local/lib/python2.6/dist-packages/Bio/SeqIO/QualityIO.py:676:
UserWarning: Data loss - max PHRED quality 62 in Illumina FASTQ
warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ")
@Test PHRED qualities from 93 to 0 inclusive
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@
But, with bioperl-live SVN,
$ ./bioperl_sanger2illumina < sanger_93.fastq
--------------------- WARNING ---------------------
MSG: Quality values not found for
illumina:63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93
---------------------------------------------------
@Test PHRED qualities from 93 to 0 inclusive
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN
+
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@
You are using "@" (ASCI 64), which in this context means a PHRED score of zero.
===========================
Sanger To Solexa
===========================
Likewise when mapping a Sanger FASTQ file with very high scores to
Solexa FASTQ, these don't get the maximum value imposes (ASCII 126,
tidle). For example,
$ ./biopython_sanger2solexa < sanger_93.fastq
/usr/local/lib/python2.6/dist-packages/Bio/SeqIO/QualityIO.py:764:
UserWarning: Data loss - max Solexa quality 62 in Solexa FASTQ
warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ")
@Test PHRED qualities from 93 to 0 inclusive
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFECB@>;;
But,
$ ./bioperl_sanger2solexa < sanger_93.fastq
--------------------- WARNING ---------------------
MSG: Quality values not found for
solexa:0,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93
---------------------------------------------------
@Test PHRED qualities from 93 to 0 inclusive
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN
+
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFEDB@><<
i.e. You've mapped the high value scores to "<", ASCII 60, thus Solexa -4
(an odd thing to happen - getting the lowest score wouldn't surprise me so
much).
Furthermore, notice that PHRED scores 0 and 1 have both been mapped
to "<", ASCII 60, thus Solexa -4, and not ";" ASCII 59 meaning Solexa -5.
===========================
Still, things are looking up :)
Peter
More information about the Bioperl-l
mailing list