[Biopython] random access to bgz file

Peter Cock p.j.a.cock at googlemail.com
Wed Apr 9 21:00:38 UTC 2014


On Wed, Apr 9, 2014 at 6:35 PM, tc9 <tc9 at sanger.ac.uk> wrote:
>
> Peter, thanks for link to html version of the bgzf documentation. Here
> some additional details.
>
> I am trying to do random access on a bgzipped haplotype/HAPS file.
> Here file format description:
>
> https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#hapsample
>
> I compressed the haps file with bgzip:
>
> zcat file.haps.gz | bgzip > file.haps.bgz
>
> I know the byte position of each newline after decompression,
> but I need the block offsets to go from a decompressed position
> to a virtual offset.

Not necessarily - all you need is the virtual offset which
handle.tell() would give you. How did you get the positions
in the decompressed file? Can you not repeat that indexing
but using the virtual offsets via the BGZF handle? The
big advantage is you just use the virtual offsets without
having to know how they are calculated.

If you really want to map from decompressed offsets to
virtual offsets, you will need both the raw start offset of
each block, but also the decompressed size of each
block (often 64kb, but it can be less).

> Trying to get the block offsets like this fails:
>
> import Bio
> handle = Bio.bgzf.open('file.haps.bgz')
> for values in Bio.bgzf.BgzfBlocks(handle):
>  print("Raw start %i, raw length %i; data start %i, data length %i" %
> values)

The BgzfBlocks function (which was intended for
low level debugging originally) wants a raw handle
(which should be opened in binary mode). I concede
its docstring doesn't say that (yet) but its example
show this. Try:

from Bio import bgzf
for values in bgzf.BgzfBlocks(open('file.haps.bgz', 'rb')):
  print("Raw start %i, raw length %i; data start %i, data length %i" % values)

Peter



More information about the Biopython mailing list