[Biopython] random access to bgz file
tc9
tc9 at sanger.ac.uk
Wed Apr 9 17:35:23 UTC 2014
Peter, thanks for link to html version of the bgzf documentation. Here
some additional details.
I am trying to do random access on a bgzipped haplotype/HAPS file. Here
file format description:
https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#hapsample
I compressed the haps file with bgzip:
zcat file.haps.gz | bgzip > file.haps.bgz
I know the byte position of each newline after decompression, but I need
the block offsets to go from a decompressed position to a virtual
offset. Trying to get the block offsets like this fails:
import Bio
handle = Bio.bgzf.open('file.haps.bgz')
for values in Bio.bgzf.BgzfBlocks(handle):
print("Raw start %i, raw length %i; data start %i, data length %i" %
values)
I get this error message:
for values in Bio.bgzf.BgzfBlocks(handle):
File "/software/team149/lib/python3.3/site-packages/Bio/bgzf.py", line
392, in BgzfBlocks
block_length, data = _load_bgzf_block(handle)
File "/software/team149/lib/python3.3/site-packages/Bio/bgzf.py", line
407, in _load_bgzf_block
% (_bgzf_magic, magic, handle.tell()))
ValueError: A BGZF (e.g. a BAM file) block should start with
b'x1fx8bx08x04', not b'1:10'; handle.tell() now says 4
How can I get the block offsets, so I can access a random byte/line of
my choice?
On 2014-04-09 09:54, Peter Cock wrote:
> Hi Tommy,
>
> This isn't covered in the tutorial, but the module's built in
> help is quite extensive (the docstrings). Try:
>
> from Bio import bgzf
> help(bgzf)
>
> Or, the HTML rendered version:
> http://biopython.org/DIST/docs/api/Bio.bgzf-module.html [3]
>
> (Note to self - that could be made prettier by checking
> the markup works, rather than treating it as plain text)
>
> Or, read the source on GitHub etc:
> https://github.com/biopython/biopython/blob/master/Bio/bgzf.py [4]
>
> Essentially, like any other Python handle use the seek
> and tell methods - however the offsets are BGZF virtual
> offets which are ordered but you CANNOT do offset
> arithmetic on them. See also:
> http://blastedbio.blogspot.co.uk/2011/11/bgzf-blocked-bigger-better-gzip.html [5]
>
> Peter
>
> On Tue, Apr 8, 2014 at 10:24 PM, Tommy Carstensen <tc9 at sanger.ac.uk> wrote:
>
>> I read the Biopython tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html [1] It does not explain how to do random access to a bgz file. Can someone point me to a tutorial on how to do this? Thank you. Best wishes, Tommy -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython [2]
Links:
------
[1] http://biopython.org/DIST/docs/tutorial/Tutorial.html
[2] http://lists.open-bio.org/mailman/listinfo/biopython
[3] http://biopython.org/DIST/docs/api/Bio.bgzf-module.html
[4] https://github.com/biopython/biopython/blob/master/Bio/bgzf.py
[5]
http://blastedbio.blogspot.co.uk/2011/11/bgzf-blocked-bigger-better-gzip.html
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
More information about the Biopython
mailing list