[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Laurent lgautier at gmail.com
Sat Jun 5 12:06:00 UTC 2010


On 05/06/10 13:43, Peter wrote:
> On Sat, Jun 5, 2010 at 6:12 AM, Laurent<lgautier at gmail.com>  wrote:
>>>
>>> I've been looking at that this afternoon as it is used in BAM files.
>>> However, most gzip files (e.g. FASTA or FASTQ files) created with
>>> the gzip command line tools will NOT follow the BGZF convention.
>>> I personally have no need to have random access to gzipped general
>>> sequence files files.
>>>
>>> However, I have some proof of concept code to exploit GZIP files using
>>> the BGZF structure which should give more efficient random access to
>>> any part of the file (compared to simply using the gzip module) but
>>> haven't yet done any benchmarking. The code is still very immature,
>>> but if you want a look see the _BgzfHandle class here:
>>>
>>> http://github.com/peterjc/biopython/commit/416a795ef618c937bf5d9acbd1ffdf33c4ae4767
>>
>> Are you using that gzip obscure option that inserts "ticks" throughout the
>> file ? If so, I remember reading that this could lead to problems (I just
>> can't remember which ones... may be it can be found on the web).
>
> I'm not sure what you are refering to - probably not. The way BGZF works
> is it is a standard GZIP file, made up of multiple GZIP blocks which can
> be decompressed in isolation (each with their own GZIP header). For
> random access to any part of the file all you need is the block offset
> (raw bytes, non-negative) and then a relative offset from the start of that
> block after decompression (again, non-negative).
>
> The non-standard bit is they use the optional subfields in the GZIP header
> to record the block size - presumably this cannot be infered any other
> way. This information gives you the block offsets which are used when
> constructing the index.
>
>>>> More generally, compression is part of the HDF5 format and this with
>>>> chunks could prove the most battle-tested way to access entries
>>>> randomly.
>>>
>>> But (thus far) no sequence data is stored in HDF5 format (is it?).
>>
>> Last year, in a SIG at the ISMB in Stockholm people showed that they have
>> stored next-gen/short-reads using HDF5, and have demonstrated superior
>> performances to BAM (not completely a surprise since to some BAM is
>> reinventing some of the features in HDF5, and HDF5 has been developed for a
>> longer time). I think that their slides are on slideshare (or similar
>> place).
>
> There is some talk on the samtools mailing list about general improvements
> to the chunking in BAM, relocating the header information (and other very
> read specific things about representing error models, indels, etc). You may
> be right that HDF5 has technical advantages over BAM version 1, but currently
> my impression is that SAM/BAM is making good headway with becoming
> a defacto standard for next generation data, while HDF5 is not. Maybe
> someone should suggest they move to HDF5 internally for BAM version 2?

De-facto standards happen to become so because more people use them at 
some point (which may involve step during which a lot of people 
/believe/ that most of the people are using a format over an other ;-) 
), but this is indeed not necessarily making them the best technical 
solutions.

I do believe that building on HDF5 is a better approach:
- better use of resources (do not reinvent completely what is already 
existing unless better)
- HDF5 is designed as a rather general storage architecture, and will 
let one build tailored solutions when needed.

I'd be surprised the BAM/SAM do not know about HDF formats, but I do not 
know for sure. Is there any BAM/SAM person reading ?



Laurent



> Peter




More information about the Biopython mailing list