[Bioperl-l] Announcing Bio::SFF

Peter Cock p.j.a.cock at googlemail.com
Wed Dec 14 16:44:28 UTC 2011


On Wed, Dec 14, 2011 at 4:27 PM, Leon Timmermans
<l.m.timmermans at students.uu.nl> wrote:
> On Wed, Dec 14, 2011 at 5:03 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>>
>> Hi Leon,
>>
>> Have you looked at the index block at all, in order to offer random
>> access by read ID, or to access the Roche XML manifest? Please
>> ask if you need more information about this - or if you can read Python:
>> https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py
>
> I have looked at it, but not implemented it yet. There is no standardized
> index, and the ones that are in common use either seem stupid (the Roche
> index, which is essentially just a weirdly formatted sequential list, though
> that should still be faster than a table scan) or undocumented (hash based
> index).

There are two widely used indexes, both from Roche (one with and
one without an XML manifest, magic bytes .mft and .srt). They are
both just a simple table of the reads names and offsets, sorted
alphabetically. This works pretty well for rapid lookup for SFF files
(because the read count is not so high), and is pretty easy.

I don't think anyone used the hash table style indexes (.hsh), which
I assume was a proof of principle or trial in the early days of SFF.

One thing to check is what Ion Torrent's SFF files use. I would
guess they've followed Roche, but I don't know. After all, the
index structure is not defined in the SFF specification - it was
left extensible on purpose.

>> Is this building on Miguel Pignatelli's work? I don't recall seeing
>> any follow up posts from him after this one:
>> http://lists.open-bio.org/pipermail/bioperl-l/2010-November/034239.html
>
> It isn't. I like his idea for reusing BioPython's test files though.

Yes, please do.

Peter



More information about the Bioperl-l mailing list