[Biopython-dev] Indexing sequences compressed with BGZF (Blocked GNU Zip Format)

Tue Nov 8 15:38:32 UTC 2011

Dear all,

We've talking in the past about indexing sequencing in gzipped files, e.g.
http://lists.open-bio.org/pipermail/biopython/2010-June/006546.html

That discussion concluded that random access into simple GZIP files
was not practical, but BGZF (used in BAM) was worth looking into.
I wrote some proof of principle code back then:
http://lists.open-bio.org/pipermail/biopython/2010-June/006555.html

I have recently polished that old code up, and done some
benchmarking (using some reasonably large FASTA, Swiss,
and UniProt-XML files). Please read this blog post:
http://blastedbio.blogspot.com/

I think random access to sequences compressed with BGZF is fast
enough to be useful practically (while confirming this is not true for
large gzipped files). I've also put this idea forward on SEQanswers,
http://seqanswers.com/forums/showthread.php?t=15347

The cleaned up BGZF code is on the following branch:
https://github.com/peterjc/biopython/tree/bgzf

This adds a new module Bio.bgzf (position in namespace open to
debate) which provides read/write handles to BGZF files - trying to
follow the API used in the Python gzip library.

I then use the new BGZF reader (with its special seek/tell offsets)
from within Bio.SeqIO's index functionality. I've been doing testing
with Bio.SeqIO.index(...) only so far, but it should work fine with
Bio.SeqIO.index_db(...) as well but here the SQLite schema will
need a small update to record the compression type for each file.

Is anyone interested in testing this out?

Note that to produce a BGZF file, you can use the tool bgzip in
samtools, or Bio/bgzf.py if run directly at the command line will
compress stdin to stdout. Both approaches call zlib internally,
and the run time is practically identical.

Regards,

Peter