[Biopython-dev] Indexing sequences compressed with BGZF (Blocked GNU Zip Format)
p.j.a.cock at googlemail.com
Tue Nov 8 10:38:32 EST 2011
We've talking in the past about indexing sequencing in gzipped files, e.g.
That discussion concluded that random access into simple GZIP files
was not practical, but BGZF (used in BAM) was worth looking into.
I wrote some proof of principle code back then:
I have recently polished that old code up, and done some
benchmarking (using some reasonably large FASTA, Swiss,
and UniProt-XML files). Please read this blog post:
I think random access to sequences compressed with BGZF is fast
enough to be useful practically (while confirming this is not true for
large gzipped files). I've also put this idea forward on SEQanswers,
The cleaned up BGZF code is on the following branch:
This adds a new module Bio.bgzf (position in namespace open to
debate) which provides read/write handles to BGZF files - trying to
follow the API used in the Python gzip library.
I then use the new BGZF reader (with its special seek/tell offsets)
from within Bio.SeqIO's index functionality. I've been doing testing
with Bio.SeqIO.index(...) only so far, but it should work fine with
Bio.SeqIO.index_db(...) as well but here the SQLite schema will
need a small update to record the compression type for each file.
Is anyone interested in testing this out?
Note that to produce a BGZF file, you can use the tool bgzip in
samtools, or Bio/bgzf.py if run directly at the command line will
compress stdin to stdout. Both approaches call zlib internally,
and the run time is practically identical.
More information about the Biopython-dev