[Biopython] random access to bgz file

tc9 tc9 at sanger.ac.uk
Wed Apr 9 17:35:23 UTC 2014


 

Peter, thanks for link to html version of the bgzf documentation. Here
some additional details. 

I am trying to do random access on a bgzipped haplotype/HAPS file. Here
file format description: 

https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#hapsample


I compressed the haps file with bgzip: 

zcat file.haps.gz | bgzip > file.haps.bgz 

I know the byte position of each newline after decompression, but I need
the block offsets to go from a decompressed position to a virtual
offset. Trying to get the block offsets like this fails: 

import Bio
handle = Bio.bgzf.open('file.haps.bgz')
for values in Bio.bgzf.BgzfBlocks(handle):
 print("Raw start %i, raw length %i; data start %i, data length %i" %
values) 

I get this error message: 

for values in Bio.bgzf.BgzfBlocks(handle):
 File "/software/team149/lib/python3.3/site-packages/Bio/bgzf.py", line
392, in BgzfBlocks
 block_length, data = _load_bgzf_block(handle)
 File "/software/team149/lib/python3.3/site-packages/Bio/bgzf.py", line
407, in _load_bgzf_block
 % (_bgzf_magic, magic, handle.tell()))
ValueError: A BGZF (e.g. a BAM file) block should start with
b'x1fx8bx08x04', not b'1:10'; handle.tell() now says 4 

How can I get the block offsets, so I can access a random byte/line of
my choice? 

On 2014-04-09 09:54, Peter Cock wrote: 

> Hi Tommy,
> 
> This isn't covered in the tutorial, but the module's built in
> help is quite extensive (the docstrings). Try:
> 
> from Bio import bgzf
> help(bgzf)
> 
> Or, the HTML rendered version:
> http://biopython.org/DIST/docs/api/Bio.bgzf-module.html [3]
> 
> (Note to self - that could be made prettier by checking
> the markup works, rather than treating it as plain text)
> 
> Or, read the source on GitHub etc:
> https://github.com/biopython/biopython/blob/master/Bio/bgzf.py [4]
> 
> Essentially, like any other Python handle use the seek
> and tell methods - however the offsets are BGZF virtual
> offets which are ordered but you CANNOT do offset
> arithmetic on them. See also:
> http://blastedbio.blogspot.co.uk/2011/11/bgzf-blocked-bigger-better-gzip.html [5]
> 
> Peter
> 
> On Tue, Apr 8, 2014 at 10:24 PM, Tommy Carstensen <tc9 at sanger.ac.uk> wrote:
> 
>> I read the Biopython tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html [1] It does not explain how to do random access to a bgz file. Can someone point me to a tutorial on how to do this? Thank you. Best wishes, Tommy -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython [2]

 

Links:
------
[1] http://biopython.org/DIST/docs/tutorial/Tutorial.html
[2] http://lists.open-bio.org/mailman/listinfo/biopython
[3] http://biopython.org/DIST/docs/api/Bio.bgzf-module.html
[4] https://github.com/biopython/biopython/blob/master/Bio/bgzf.py
[5]
http://blastedbio.blogspot.co.uk/2011/11/bgzf-blocked-bigger-better-gzip.html


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 



More information about the Biopython mailing list