[Biopython-dev] Line wrapping in FASTQ output
Peter
biopython at maubp.freeserve.co.uk
Wed Jul 22 11:56:23 UTC 2009
Hi Peter R. et al,
Up until now I had mostly been trying EMBOSS 6.1.0 with short read
data. I've just noticed for longer reads EMBOSS wraps the sequences
and qualities lines in FASTQ output (at 60 characters). There is an
example of this at the end of the email.
My understanding is that while line breaks are allowed in the
sequences and qualities lines of a FASTQ file, they are discouraged as
it can break simple minded parsers. Unfortunately right now I can't
find any references/websites to back up this assertion (other than
things I wrote myself since), but I was sure I read this on the MAQ
site somewhere. Several sites do simply talk about "the" sequence line
and "the" quality line (indeed the early drafts of the wikipedia page
had this assumption, which I fixed). This is natural if all you have
ever worked with is short read data. Of course, 454 reads are hundreds
of bases long, and even the latest Illumina reads now are in the range
70 to 100 bp (or so I hear), so this issue will become more common -
so any existing parsers that can't cope with line breaks will soon get
broken, and hopefully fixed.
For Biopython we should be able cope with any strange line breaks in
the sequences and qualities lines on input, but for output don't do
any line wrapping. I felt this would result in more widely parseable
output. I wondered what your thought process was, and if you think it
is worth removing the line wrapping on EMBOSS's FASTQ output (or
indeed, if you have a good argument to convince me to make Biopython
output FASTQ with line wrapping by default).
[I nearly CC'd BioPerl-l with this. In fact, this topic strikes me as
ideal for an OBF cross project mailing list, something we talked about
at BOSC/ISMB 2009. Am I right in thinking you (Peter Rice) were going
to look into this?]
Regards,
Peter C. (at Biopython)
e.g.
$ embossversion
Reports the current EMBOSS version number
6.1.0
$ more sanger_93.fastq
@Test PHRED qualities from 93 to 0 inclusive
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN
+
~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"!
It is likely that email software will mangle the line breaks, but in
my example file sanger_93.fastq the sequence and the quality are
single line strings (of length 94).
Now let's let EMBOSS seqret read this in and write it out again:
$ seqret -filter -seq sanger_93.fastq -sformat fastq-sanger -osformat
fastq-sanger
@Test PHRED qualities from 93 to 0 inclusive
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTGACTGAN
+Test
~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDC
BA@?>=<;:9876543210/.-,+*)('&%$#"!
The new lines are real and not just from the email formatting - you
can check this by piping the output though hexdump. It appears EMBOSS
is using 60 character line wrapping.
Peter C.
More information about the Biopython-dev
mailing list