[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Peter biopython at maubp.freeserve.co.uk
Fri Jun 4 19:04:16 UTC 2010


On Fri, Jun 4, 2010 at 7:25 PM, Laurent <lgautier at gmail.com> wrote:
>
> One note of caution: Python's gzip module is slow, or so I experienced... to
> the point that I ended up wrapping the code into a function that gunzipped
> the file to a temporary location, parse and extract information, then delete
> the temporary file.
>

That should be easy to benchmark - using Python's gzip to parse a file
versus using the command line tool gzip to decompress and then parse
the uncompressed file.

>
> Regarding random access in compressed file, there is the BGZF format but I
> am not familiar enough with it to tell whether it can be of use here.
>

I've been looking at that this afternoon as it is used in BAM files. However,
most gzip files (e.g. FASTA or FASTQ files) created with the gzip command
line tools will NOT follow the BGZF convention. I personally have no need
to have random access to gzipped general sequence files files.

However, I have some proof of concept code to exploit GZIP files using the
BGZF structure which should give more efficient random access to any part
of the file (compared to simply using the gzip module) but haven't yet done
any benchmarking. The code is still very immature, but if you want a look
see the _BgzfHandle class here:

http://github.com/peterjc/biopython/commit/416a795ef618c937bf5d9acbd1ffdf33c4ae4767

>
> More generally, compression is part of the HDF5 format and this with chunks
> could prove the most battle-tested way to access entries randomly.
>

But (thus far) no sequence data is stored in HDF5 format (is it?).

Peter



More information about the Biopython mailing list