[Biopython-dev] Indexing sequences compressed with BGZF (Blocked GNU Zip Format)
Peter Cock
p.j.a.cock at googlemail.com
Tue Nov 8 15:38:32 UTC 2011
Dear all,
We've talking in the past about indexing sequencing in gzipped files, e.g.
http://lists.open-bio.org/pipermail/biopython/2010-June/006546.html
That discussion concluded that random access into simple GZIP files
was not practical, but BGZF (used in BAM) was worth looking into.
I wrote some proof of principle code back then:
http://lists.open-bio.org/pipermail/biopython/2010-June/006555.html
I have recently polished that old code up, and done some
benchmarking (using some reasonably large FASTA, Swiss,
and UniProt-XML files). Please read this blog post:
http://blastedbio.blogspot.com/
I think random access to sequences compressed with BGZF is fast
enough to be useful practically (while confirming this is not true for
large gzipped files). I've also put this idea forward on SEQanswers,
http://seqanswers.com/forums/showthread.php?t=15347
The cleaned up BGZF code is on the following branch:
https://github.com/peterjc/biopython/tree/bgzf
This adds a new module Bio.bgzf (position in namespace open to
debate) which provides read/write handles to BGZF files - trying to
follow the API used in the Python gzip library.
I then use the new BGZF reader (with its special seek/tell offsets)
from within Bio.SeqIO's index functionality. I've been doing testing
with Bio.SeqIO.index(...) only so far, but it should work fine with
Bio.SeqIO.index_db(...) as well but here the SQLite schema will
need a small update to record the compression type for each file.
Is anyone interested in testing this out?
Note that to produce a BGZF file, you can use the tool bgzip in
samtools, or Bio/bgzf.py if run directly at the command line will
compress stdin to stdout. Both approaches call zlib internally,
and the run time is practically identical.
Regards,
Peter
More information about the Biopython-dev
mailing list