[Open-bio-l] [Biojava-l] .sff support

Mon Feb 22 06:35:11 EST 2010

On Mon, Feb 15, 2010 at 10:32 PM, Charles Imbusch <charles at imbusch.net> wrote:
>
> Hi all,
>
> I've been playing around with the sff file based on the file
> format definition at NCBI.
> I uploaded the output which includes the common header,
> the read header and read data section for the first read
> of that file.
>
> http://home.arcor.de/cimbusch/output.txt

Looks like you've been making excellent progress :)

Sorry for the delay in my reply, I was on leave last week (and
without internet access for most of it).

>> I'm happy to answer questions on how the file format works
>> (including the undocumented index block which I had to reverse
>> engineer).
>>
>
> Yes, I would like to know how that works.
> index_magic_number:778921588 .mft
> version:1.00
> Couldn't find anything about ".mft" version 1.

I believe ".mft" stands for "Manifest format", and Roche 454 use this
block to hold both a read index and an XML string (the manifest).
Immediately after the ".mft1.00" string are two longs which give the
lengths of the XML string and the actual index data. Then comes
the XML manifest string, followed by the actual index data (same
format as Roche's older ".srt" index only block, uses base 256).

Note the Biopython SFF code has now been merged into our trunk:
http://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py

> At the moment I have two classes: sffParser and sffFile
> My idea was that sffParser can hold one or multiple sff files.
> Each instance of sffFile has a hashtable with the identifiers
> as keys and the filepointers are stored as the values.

Not all SFF files will have an index, but the Roche .srt and .mft
index blocks will let you map from the ID to the offset. I take
advantage of this in Biopython for our Bio.SeqIO.index(...)
functionality with a slower fall back on scanning the file to
build the index if the index information is missing (or in an
unsupported format). The Biopython index code then uses
a Python dictionary (hash) to hold the mapping from read
name to file offset. See also:

http://github.com/biopython/biopython/blob/master/Bio/SeqIO/_index.py

> Now I would like to find a good representation of one single "read" object,
> which shall be accessible with an identifier like EV5RTWS02JXUUH

I think this is a Java question, so not my area of expertise.

Peter