[Biopython] Paired end SFF data

Wed Aug 19 13:26:18 UTC 2009

Hi all,

We're been talking about adding Bio.SeqIO support for the binary
Standard Flowgram Format (SFF) file format (used for Roche 454 data).
This is a public standard documented here:
http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=formats

One of the open questions was how to deal with paired end data. The
Roche 454 website has a nice summary of how this works:
http://www.454.com/products-solutions/experimental-design-options/multi-span-paired-end-reads.asp

Basically after sample preparation, you have DNA fragments containing
the 3' end of your sequence, a known linker, and then the 5' end of
your sequence. The sequencing machine doesn't need to know what the
magic linker sequence is, and (I infer) after sample preparation,
everything proceeds as normal for single end 454 sequencing. The
upshot is the SFF file for a paired end read is exactly like any other
SFF file (apparently even for the XML meta data Roche include), just
most of the reads should have a "magic" linker sequence somewhere in
them.

I have located some publicly available Roche SFF files at the Sanger
Centre which include some paired end reads (note the rules about
publishing analysis of this data):
http://www.sanger.ac.uk/Projects/Echinococcus/
ftp://ftp.sanger.ac.uk/pub/pathogens/Echinococcus/multilocularis/reads/454/

For example, the 2008_09_03.tar.gz archive contains a single 446MB
file FGGXRDY01.sff with 278801 reads.
ftp://ftp.sanger.ac.uk/pub/pathogens/Echinococcus/multilocularis/reads/454/2008_09_03.tar.gz

This is the XML meta data from FGGXRDY01.sff,

<manifest>
  <run>
    <run_type>454</run_type>
    <accession_prefix>FGGXRDY</accession_prefix>
    <run_name>R_2008_09_03_10_22_15_FLX08070222_adminrig_EmuR1Ecoli7122PE</run_name>
    <analysis_name>/data/2008_09_03/R_2008_09_03_10_22_15_FLX08070222_adminrig_EmuR1Ecoli7122PE/D_2008_09_03_14_23_54_FLX08070222_EmuR1Ecoli7122PE_FullAnalysis</analysis_name>
    <path>/data/2008_09_03/R_2008_09_03_10_22_15_FLX08070222_adminrig_EmuR1Ecoli7122PE/D_2008_09_03_14_23_54_FLX08070222_EmuR1Ecoli7122PE_FullAnalysis</path>
  </run>
  <qualityScoreVersion>1.1.03</qualityScoreVersion>
</manifest>

Nothing in this XML meta data says this is for paired end reads (nor
can this be specified elsewhere in the SFF file format). However of
the 278801 reads in FGGXRDY01.sff, about a third (108823 if you look
before trimming) have a perfect match to the FLX linker:
GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC.

Note that the linker sequence depends on how the sample was prepared,
and differs for different Roche protocols. e.g. According to the
wgs-assembler documentation the known Roche 454 Titanium paired end
linkers are instead:
TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG and
CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
See http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Formatting_Inputs

This is a good news/bad news situation for Bio.SeqIO. The bad news is
that identifying the paired end reads means knowing the linker
sequence(s) used, and finding them within each read. If the read was
sequenced perfectly, this is easy - but normally some heuristics are
needed. I see this as outside the scope of basic file parsing (i.e.
not something to go in Bio.SeqIO, but maybe in Bio.SeqUtils or
Bio.Sequencing). The good news is that Bio.SeqIO can treat paired end
SFF files just like single end reads - we don't have to worry about
complicated new Seq/SeqRecord objects to hold short reads separated by
an unknown region of estimated length.

Peter