[Biopython] Reading from stdin with Bio.SeqIO

Fri Jun 5 07:21:35 EDT 2009

On Fri, Jun 5, 2009 at 11:57 AM, Giles
Weaver<giles.weaver at googlemail.com> wrote:
> Thanks Brad, Peter,
>
> I did write code almost identical to the code that Brad posted, so I was on
> the right track, but being new to Python I'm not familiar with interpreting
> the error messages. Foolishly, I'd neglected to check that fastq-solexa was
> supported in my Biopython install. Having replaced Biopython 1.49 (from the
> Ubuntu repos) with 1.50 I seem to be in business.

Its great that things are working now. Can you suggest how we
might improve the "Unknown format 'fastq-solexa'" message you
would have seen? It could be longer and suggest checking the
latest version of Biopython?

> I did have a look at the maq documentation at
> http://maq.sourceforge.net/fastq.shtml and tried the script at
> http://maq.sourceforge.net/fq_all2std.pl, but found that when I piped the
> output into bioperl I got the following errors:
>
> MSG: Seq/Qual descriptions don't match; using sequence description
> MSG: Fastq sequence/quality data length mismatch error
>
> The good news is that using Biopython instead of fq_all2std.pl I don't get
> the data length mismatch error.

Now that you mention this, I recall trying to email Heng Li about an
apparent bug in fq_all2std.pl where the FASTQ quality string had an
extra letter ("!") attached. I may not have the right email address as I
never got a reply (on this issue or regarding some missing brackets
in the formula on http://maq.sourceforge.net/fastq.shtml in perl).

> The descriptions mismatch error I'm not worried about, as it looks
> like its just bioperl complaining because the (apparently optional)
> quality description doesn't exist.

Good. On large files it really does make sense to omit this extra string,
but the FASTQ format is a little nebulous with multiple interpretations.

> There is a recent thread on the bioperl mailing lists where Heikki
> Lehvaslaiho has written a very detailed post
> (http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030017.html) on the
> peculiarities of sanger/solexa/illumina quality encoding.

If you follow the BioPerl list, you might want to point out that PHRED
quality scores really can be very high when referring to assemblies
(e.g. output from maq), covering the range 0 to 93, as I learnt on Bug
2848. When considering actual raw reads, then the upper bound is
much lower.

See http://bugzilla.open-bio.org/show_bug.cgi?id=2848

> Evidently there are a lot of pitfalls for the unwary, and there may be issues
> with the maq implementation. If the maq script was used as a reference for
> the biopython version you may want to check that the same issues haven't
> been replicated in biopython.

The FASTQ format description on the maq pages where very useful,
and I did try testing against fq_all2std.pl before running into the above
mentioned apparent bug. I should probably try emailing Heng Li again...

Peter