[Biopython] random access to bgz file
tc9
tc9 at sanger.ac.uk
Mon Apr 14 13:29:35 EDT 2014
On 2014-04-09 22:00, Peter Cock wrote:
> On Wed, Apr 9, 2014 at 6:35 PM, tc9 <tc9 at sanger.ac.uk> wrote:
>
>> Peter, thanks for link to html version of the bgzf documentation. Here some additional details. I am trying to do random access on a bgzipped haplotype/HAPS file. Here file format description: https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#hapsample [1] I compressed the haps file with bgzip: zcat file.haps.gz | bgzip > file.haps.bgz I know the byte position of each newline after decompression, but I need the block offsets to go from a decompressed position to a virtual offset.
>
> Not necessarily - all you need is the virtual offset which
> handle.tell() would give you. How did you get the positions
> in the decompressed file? Can you not repeat that indexing
> but using the virtual offsets via the BGZF handle? The
> big advantage is you just use the virtual offsets without
> having to know how they are calculated.
>
> If you really want to map from decompressed offsets to
> virtual offsets, you will need both the raw start offset of
> each block, but also the decompressed size of each
> block (often 64kb, but it can be less).
Initially I got the byte positions in the decompressed stream by reading
the entire thing once with gzip.open(). I re-read the compressed file
with BgzfReader and got the virtual offset of line number 1 million and
was able to seek that line with BgzfReader much faster than I could have
done with gzip.open(). See solution below, which I will post to a
question of mine on stackoverflow.com.
from Bio import bgzf
file='file.haps.gz'
handle = bgzf.BgzfReader(file)
for i in range(10**6):
handle.readline()
virtual_offset = handle.tell()
line1 = handle.readline()
handle.close()
handle = bgzf.BgzfReader(file)
handle.seek(virtual_offset)
line2 = handle.readline()
handle.close()
assert line1==line2
For completeness I want to mention that one can do:
block_start_offset, within_block_offset =
bgzf.split_virtual_offset(virtual_offset)
virtual_offset = bgzf.make_virtual_offset(block_start_offset,
within_block_offset)
P.S. I was without a stable internet connection for a few days. Hence
the slow reply. Thanks for the help!
Links:
------
[1]
https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#hapsample
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
More information about the Biopython
mailing list