[emboss-dev] EMBOSS seqret FASTQ support
Peter
biopython at maubp.freeserve.co.uk
Mon Jul 20 18:12:29 EDT 2009
Earlier I wrote:
> Hi all,
>
> I've just been having a play with the FASTQ support in seqret from EMBOSS 6.1.0
> ...
> So far so good :)
Could anyone spot a "but" coming up?
Well, here we are - consider the following single Sanger format
FASTQ record (originally from the NCBI SRA, I think SRA000271,
but I would have to double check that).
@071113_EAS56_0053:1:1:182:712
ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG
+071113_EAS56_0053:1:1:182:712
@IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
I would guess the problem is that quality line starts with a @,
meaning care is needed. Likewise of course, quality lines can
start with a + character too (although in my quick testing
EMBOSS seems happy with these).
The ASCII code for @ is 64, meaning for a Sanger style file this
is a PHRED quality of 64-33 = 31. Here is what Biopython gives
for the FASTA conversion:
>071113_EAS56_0053:1:1:182:712
ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG
And this is what Biopython gives for the QUAL conversion,
showing the PHRED scores as integers:
>071113_EAS56_0053:1:1:182:712
31 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 34 35 40 40
40 40 40 27 4 27 21 5 12 9 8 13 7 9 4 10
Anyway, EMBOSS doesn't seem to like this example FASTQ record:
$ seqret -sequence tricky_one.fastq -sformat fastq -osformat fasta -filter
Error: Unable to read sequence 'tricky_one.fastq'
Died: seqret terminated: Bad value for '-sequence' with -auto defined
This read is actually one of four records in the following Biopython
test file, in which EMBOSS only seems to find the first record:
http://biopython.org/SRC/biopython/Tests/Quality/tricky.fastq
As described here, this is a hand modified version of a real NCBI
FASTQ file to show case several potential gotchas in parsing FASTQ
(including some unlikely to occur in real life - unless someone were
to concatenate FASTQ files from separate sources or something):
http://www.biopython.org/DIST/docs/api/Bio.SeqIO.QualityIO-module.html#FastqGeneralIterator
In fact, looking at that again now, maybe I should include another
record where the sequence line starts with a "+" as well... maybe
even a record with the quality split over multiple lines some starting
with @ and some with +. That would be an even better evil test ;)
Regards,
Peter C.
More information about the emboss-dev
mailing list