[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Thu Jun 3 13:52:31 EDT 2010

Dear Biopythoneers,

We've had several discussions (mostly on the development list) about
extending the Bio.SeqIO.index() functionality. For a quick recap of what
you can do with it right now, see:

http://news.open-bio.org/news/2009/09/biopython-seqio-index/
http://news.open-bio.org/news/2010/04/partial-seq-files-biopython/

There are two major additions that have been discussed (and some
code written too): gzip support and storing the index on disk.

Currently Bio.SeqIO.index() has to re-index the sequence file each
time you run your script. If you run the same script often, it would be
useful to be able to save the index information to disk.  The idea is
that you can then load the index file and get almost immediate
random access to the associated sequence file (with waiting to scan
the file to rebuild the index). The old OBDA style indexes used by
BioPerl, BioRuby etc are one possible file format we might use, but
a simple SQLite database may be preferable. This also would give
us a way to index really big files with many millions of reads without
keeping the file offsets in memory. This is going to be important for
random access to the latest massive sequencing data files.

Next, support for indexing compressed files (initially concentrating
on Unix style gzipped files, e.g. example.fasta.gz) without having
to decompress the whole file. You can already parse these files
with Bio.SeqIO in conjunction with the Python gzip module. It would
be nice to be able to index them too.

Now ideally we'd be able to offer both of these features - but if
you had to vote, which would be most important and why?

Peter