[Biopython-dev] sff reader

Fri Aug 14 07:57:26 EDT 2009

On Thu, Aug 13, 2009 at 6:33 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
>
> You'll probably be interested to know I've made some excellent progress
> with the (optional) SFF index block. I note that the specifications (both
> on the NCBI page and in the Roche manual) appear to suggest that the
> index block could appear in the middle of the the read data. However,
> in all the examples I have looked at, the index is actually at the end.
>
> http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=format#sff
>
> Sadly the format of the index isn't documented, but I think I have
> reverse engineered the format that Roche SFF files are using. In
> a slight twist of the specification they are actually using the index
> block for both XML meta data AND an index of the read offsets.
>
> This will dovetail nicely with the indexing support in Bio.SeqIO
> which I am working on for Biopython 1.52, branch on github.
> I expect to have fast random access to reads in an SFF file
> very soon. See http://github.com/peterjc/biopython/tree/convert

Sorry, wrong branch - my "index" branch has the indexing (as well
as SFF files and the Bio.SeqIO.convert() functionality):

http://github.com/peterjc/biopython/tree/index

I've got this code working nicely for reading or indexing SFF files.
Testing with a 2GB SFF file with 660808 Roche 454 reads, using
the Roche index I can load this in under 3 seconds and retrieve
any single record almost instantly. If the index is missing (or not
in the expected format) I have to scan the file to build my own
index, and that takes about 11 seconds - which is still fine :)

Peter