[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Sat Jun 5 08:52:25 EDT 2010

On Jun 5, 2010, at 6:43 AM, Peter wrote:

> On Sat, Jun 5, 2010 at 6:12 AM, Laurent <lgautier at gmail.com> wrote:
>>> 
>>> I've been looking at that this afternoon as it is used in BAM files.
>>> However, most gzip files (e.g. FASTA or FASTQ files) created with
>>> the gzip command line tools will NOT follow the BGZF convention.
>>> I personally have no need to have random access to gzipped general
>>> sequence files files.
>>> 
>>> However, I have some proof of concept code to exploit GZIP files using
>>> the BGZF structure which should give more efficient random access to
>>> any part of the file (compared to simply using the gzip module) but
>>> haven't yet done any benchmarking. The code is still very immature,
>>> but if you want a look see the _BgzfHandle class here:
>>> 
>>> http://github.com/peterjc/biopython/commit/416a795ef618c937bf5d9acbd1ffdf33c4ae4767
>> 
>> Are you using that gzip obscure option that inserts "ticks" throughout the
>> file ? If so, I remember reading that this could lead to problems (I just
>> can't remember which ones... may be it can be found on the web).
> 
> I'm not sure what you are refering to - probably not. The way BGZF works
> is it is a standard GZIP file, made up of multiple GZIP blocks which can
> be decompressed in isolation (each with their own GZIP header). For
> random access to any part of the file all you need is the block offset
> (raw bytes, non-negative) and then a relative offset from the start of that
> block after decompression (again, non-negative).
> 
> The non-standard bit is they use the optional subfields in the GZIP header
> to record the block size - presumably this cannot be infered any other
> way. This information gives you the block offsets which are used when
> constructing the index.
> 
>>>> More generally, compression is part of the HDF5 format and this with
>>>> chunks could prove the most battle-tested way to access entries
>>>> randomly.
>>> 
>>> But (thus far) no sequence data is stored in HDF5 format (is it?).
>> 
>> Last year, in a SIG at the ISMB in Stockholm people showed that they have
>> stored next-gen/short-reads using HDF5, and have demonstrated superior
>> performances to BAM (not completely a surprise since to some BAM is
>> reinventing some of the features in HDF5, and HDF5 has been developed for a
>> longer time). I think that their slides are on slideshare (or similar
>> place).
> 
> There is some talk on the samtools mailing list about general improvements
> to the chunking in BAM, relocating the header information (and other very
> read specific things about representing error models, indels, etc). You may
> be right that HDF5 has technical advantages over BAM version 1, but currently
> my impression is that SAM/BAM is making good headway with becoming
> a defacto standard for next generation data, while HDF5 is not. Maybe
> someone should suggest they move to HDF5 internally for BAM version 2?
> 
> Peter

I have run into a few people (primarily those interested in mapping reads to genome seq) that have pointed out some problems with SAM/BAM, particularly the lack of more definitive, clear-cut definitions for regions of non-matching sequences (possibly due to many reasons, such as splice junctions, etc).  Haven't actually looked at the SAM/BAM spec myself to see how correct this point is, but there are others either rolling their own solutions or threatening to.

chris