[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Peter biopython at maubp.freeserve.co.uk
Sat Jun 5 11:43:36 UTC 2010


On Sat, Jun 5, 2010 at 6:12 AM, Laurent <lgautier at gmail.com> wrote:
>>
>> I've been looking at that this afternoon as it is used in BAM files.
>> However, most gzip files (e.g. FASTA or FASTQ files) created with
>> the gzip command line tools will NOT follow the BGZF convention.
>> I personally have no need to have random access to gzipped general
>> sequence files files.
>>
>> However, I have some proof of concept code to exploit GZIP files using
>> the BGZF structure which should give more efficient random access to
>> any part of the file (compared to simply using the gzip module) but
>> haven't yet done any benchmarking. The code is still very immature,
>> but if you want a look see the _BgzfHandle class here:
>>
>> http://github.com/peterjc/biopython/commit/416a795ef618c937bf5d9acbd1ffdf33c4ae4767
>
> Are you using that gzip obscure option that inserts "ticks" throughout the
> file ? If so, I remember reading that this could lead to problems (I just
> can't remember which ones... may be it can be found on the web).

I'm not sure what you are refering to - probably not. The way BGZF works
is it is a standard GZIP file, made up of multiple GZIP blocks which can
be decompressed in isolation (each with their own GZIP header). For
random access to any part of the file all you need is the block offset
(raw bytes, non-negative) and then a relative offset from the start of that
block after decompression (again, non-negative).

The non-standard bit is they use the optional subfields in the GZIP header
to record the block size - presumably this cannot be infered any other
way. This information gives you the block offsets which are used when
constructing the index.

>>> More generally, compression is part of the HDF5 format and this with
>>> chunks could prove the most battle-tested way to access entries
>>> randomly.
>>
>> But (thus far) no sequence data is stored in HDF5 format (is it?).
>
> Last year, in a SIG at the ISMB in Stockholm people showed that they have
> stored next-gen/short-reads using HDF5, and have demonstrated superior
> performances to BAM (not completely a surprise since to some BAM is
> reinventing some of the features in HDF5, and HDF5 has been developed for a
> longer time). I think that their slides are on slideshare (or similar
> place).

There is some talk on the samtools mailing list about general improvements
to the chunking in BAM, relocating the header information (and other very
read specific things about representing error models, indels, etc). You may
be right that HDF5 has technical advantages over BAM version 1, but currently
my impression is that SAM/BAM is making good headway with becoming
a defacto standard for next generation data, while HDF5 is not. Maybe
someone should suggest they move to HDF5 internally for BAM version 2?

Peter



More information about the Biopython mailing list