[Biopython-dev] Fwd: More FASTQ examples for cross project testing

Tue Aug 25 15:11:34 UTC 2009

Hi all,

This was posted to the OBF cross project mailing list, but if any of
you guys have some sample FASTQ data please consider sharing
a small sample (e.g. the first ten reads). We would need this to be
"no-strings attached" so that it could be used in any of the OBF
projects under their assorted open source licences.

In addition to the notes below, I would be interested in is any
FASTQ files from your local sequence centre, which may use
their own conventions for the record title lines (e.g. record names).

Thanks,

Peter

P.S. Rather that trying to send any attachments to the mailing
list, please email me personally.

---------- Forwarded message ----------
From: Peter <biopython at maubp.freeserve.co.uk>
Date: Tue, Aug 25, 2009 at 12:24 PM
Subject: More FASTQ examples for cross project testing
To: open-bio-l at lists.open-bio.org
Cc: Peter Rice <pmr at ebi.ac.uk>, Chris Fields <cjfields at illinois.edu>

Hi all,

I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl)
off list about this plan. I'm going to co-ordinate putting together a
set of valid FASTQ files for shared testing (to supplement the
existing set of invalid FASTQ files already done and being used in
Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon).

What I have in mind is:

XXX_original_YYY.fastq - sample input
XXX_as_sanger.fastq - reference output
XXX_as_solexa.fastq - reference output
XXX_as_illumina.fastq - reference output

where XXX is some name (e.g. wrapped1, wrapped2, shortreads,
longreads, sanger_full_range, solexa_full_range ...) and YYY is the
FASTQ variant (sanger, solexa or illumina) for the "input" file.

For example, we might have:

wrapped1_original_sanger.fastq - A Sanger FASTQ using line wrapping,
perhaps repeating the title on the plus lines
wrapped1_as_sanger.fastq - The same data but using the consensus of no
line wrapping and omitting the repeated title on the plus lines.
wrapped1_as_solexa.fastq - As above, but converted in Solexa scores
(ASCII offset 64), with capping at Solexa 62 (ASCII 126).
wrapped1_as_illumina.fastq - As above, but converted to Illumina ASCII
offset 64, with capping at PHRED 62 (ASCII 126).

Here "wrapped1" would be a Sanger FASTQ file with some line wrapping
(e.g. at 60 characters). I will include "sanger_full_range" which
would cover all the valid PHRED scores from 0 to 93, and similarly for
Solexa and Illumina files - these are important for testing the score
conversions. I have some ideas for deliberately tricky (but valid)
files which should properly test any parser.

The point is we have "perhaps odd but valid" originals, plus the
"cleaned up" versions (using the same FASTQ variant), and "cleaned up"
versions in the other two FASTQ variants.

Ideally asking Biopython/BioPerl/EMBOSS to convert the
XXX_original_YYY.fastq files into any of the three FASTQ variants will
give exactly the same as the reference outputs.

If anyone has any comments or suggestions please speak up (e.g. my
suggested naming conventions).

Real life examples of FASTQ files anyone has had trouble parsing (even
with 3rd party tools) would be particularly useful - although we'd
probably want to cut down big example files in order to keep the
dataset to a reasonable size.

Thanks,

Peter