[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Sat Jun 5 05:12:57 UTC 2010

On 04/06/10 21:04, Peter wrote:
> On Fri, Jun 4, 2010 at 7:25 PM, Laurent<lgautier at gmail.com>  wrote:
>>
>> One note of caution: Python's gzip module is slow, or so I experienced... to
>> the point that I ended up wrapping the code into a function that gunzipped
>> the file to a temporary location, parse and extract information, then delete
>> the temporary file.
>>
>
> That should be easy to benchmark - using Python's gzip to parse a file
> versus using the command line tool gzip to decompress and then parse
> the uncompressed file.
>
>>
>> Regarding random access in compressed file, there is the BGZF format but I
>> am not familiar enough with it to tell whether it can be of use here.
>>
>
> I've been looking at that this afternoon as it is used in BAM files. However,
> most gzip files (e.g. FASTA or FASTQ files) created with the gzip command
> line tools will NOT follow the BGZF convention. I personally have no need
> to have random access to gzipped general sequence files files.
>
> However, I have some proof of concept code to exploit GZIP files using the
> BGZF structure which should give more efficient random access to any part
> of the file (compared to simply using the gzip module) but haven't yet done
> any benchmarking. The code is still very immature, but if you want a look
> see the _BgzfHandle class here:
>
> http://github.com/peterjc/biopython/commit/416a795ef618c937bf5d9acbd1ffdf33c4ae4767

Are you using that gzip obscure option that inserts "ticks" throughout 
the file ? If so, I remember reading that this could lead to problems (I 
just can't remember which ones... may be it can be found on the web).

>>
>> More generally, compression is part of the HDF5 format and this with chunks
>> could prove the most battle-tested way to access entries randomly.
>>
>
> But (thus far) no sequence data is stored in HDF5 format (is it?).

Last year, in a SIG at the ISMB in Stockholm people showed that they 
have stored next-gen/short-reads using HDF5, and have demonstrated 
superior performances to BAM (not completely a surprise since to some 
BAM is reinventing some of the features in HDF5, and HDF5 has been 
developed for a longer time). I think that their slides are on 
slideshare (or similar place).

Laurent

> Peter