[Biopython] Public example FASTQ files (for Tutorial examples)?

Fri Apr 1 13:57:23 UTC 2011

On Fri, Apr 1, 2011 at 10:59 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Fri, Mar 25, 2011 at 7:37 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> Hi all,
>>
>> One of the volunteers proof reading the Biopython tutorial
>> noticed our links to specific example FASTQ files at the NCBI
>> SRA don't work any more. They have withdrawn them from
>> the FTP site, although you can still download the files in
>> the compressed *.sra format and in in theory convert then
>> to FASTQ locally with the NCBI's toolkit (which is cross
>> platform).
>>
>> Another option is to download the FASTQ files via the
>> NCBI's webinterface. Unless there is an obvious way to
>> do this with a URL that I missed initially, we have a
>> complicated situation to describe where the user can
>> choose all the reads for an experiment or just the filtered
>> set, and also choose to have them pre-trimmed or not.
>> Plus for me at least, the HTPP download wasn't as
>> robust as the FTP one was.
>
> Brad pointed out we should be able to get the same reads
> from the EBI's sequence read archive, the ENA.
>
> I'm looking at that but the first example from the NCBI SRA,
> a single 23MB  FASTQ file, which I had thought was single
> ended Roche 454 data, :
>
> ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX003/SRX003639/SRR014849.fastq.gz
> [dead link]
>
> I can find the same accession on the ENA, but it seems to
> be paired end data - and looks to have longer reads than
> the file from the NCBI (probably not quality trimmed?).
>
> http://www.ebi.ac.uk/ena/data/view/SRR014849
> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR014/SRR014849/SRR014849_1.fastq.gz
> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR014/SRR014849/SRR014849_2.fastq.gz
>
> Interestingly going back to the NCBI SRA, that also says it
> is paired end data, and looking at the data it does make
> sense. I'm pretty sure the original FASTQ file I got from
> the NCBI SRA a while ago would need parsing to spot
> and split on the Roche 454 linker sequences, in this case
> the 454flx linker:
>
> GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC
>
> Curious - but it won't be a quick job to just swap the URL,
> I'll need to find another small example on the ENA instead.

I found an alternative single end Roche 454 example and updated
the tutorial. I've just been looking at the paired end Illumina example
of SRR001666, and confirmed they have the same number of
reads with the same ID and the same sequences.

See:
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001666/SRR001666_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001666/SRR001666_2.fastq.gz

The SRA FTP site used to have the files here:
ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX000/SRX000430/

Curiously, the quality strings differ very slightly.

e.g. The old SRR001666_1.fastq file from NCBI SRA FTP site had:

@SRR001666.70 071112_SLXA-EAS1_s_7:5:1:828:510 length=36
GTGCCAGAAGTGGCGGCTGGAGGGGTAAAAGATCTG
+SRR001666.70 071112_SLXA-EAS1_s_7:5:1:828:510 length=36
IIIIIIIIIIIIIIII&I<(5I+I'='6@=<;+!@+

The new SRR001666_1.fastq file from ENA FTP site contains:

@SRR001666.70 071112_SLXA-EAS1_s_7:5:1:828:510/1
GTGCCAGAAGTGGCGGCTGGAGGGGTAAAAGATCTG
+
IIIIIIIIIIIIIIII&I<(5I+I'='6@=<;+"@+

The title line from ENA includes the Illumina /1 or /2 suffix where
they show the original ID (second word), and the ENA sensibly
leaves out the redundant text length=36, and the optional plus
line repetition - that makes the file a lot smaller.

What is interesting is the ! vs " switch in the 3rd last base of
this read, ASCII 33 vs 34 so PHRED 0 vs 1 since these are
Sanger FASTQ encoded.

If I promote any PHRED 0 to 1 before the comparison, i.e.
replace any ! with ", then the files agree. This seems harmless
given the meaning of PHRED scores 0 and 1, and is likely a
minor side effect of a read compression scheme.

Anyway, interesting, and it means the Tutorial examples using
SRR001666 can probably be updated just by switching the URLs.

Peter