[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Mon Jun 7 09:49:17 UTC 2010

On Mon, Jun 7, 2010 at 10:10 AM, Ernesto <e.picardi at unical.it> wrote:
> Hi all,
>
> I followed the interesting discussion about indexing. I think that
> it is a hot point given the huge amount of data released by the
> new sequencing technologies.

Yes - although the discussion has gone beyond just indexing
to also cover storing data.

> I never used the Bio.SeqIO.index() but I'd like to test it and I'd
> like also to know how to use it. Is there a simple tutorial?

Bio.SeqIO.index() is included in Biopython 1.52 onwards. It is
covered in the main Biopython Tutorial:

http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf

There are also a few blog posts about it (linked to at the start
of this thread):

http://news.open-bio.org/news/2009/09/biopython-seqio-index/
http://news.open-bio.org/news/2010/04/partial-seq-files-biopython/

> In the past I tried pytables based on HDF5 library and I was
> impressed by its high speed. However, the indexing is not supported
> at least for the free version.

Yes, Brad wrote about this frustration earlier.

> Moreover, strings of not fixed length cannot easily handled and
> stored. For example, in order to store EST sequences you need to
> know a priori the maximum length in order to optimize the storage.
> As an alternative, VLAs (variable length arrays) could be used but
> the storing performance goes down quickly.

The BioHDF project have probably thought about this kind of issue.

However, for Bio.SeqIO.index() we don't store the sequences in a
database - just the associated file offsets.

The current Bio.SeqIO.index() code works by scanning a
sequence file and storing a lookup table of record identifiers and
file offsets in a Python dictionary. This works very well but once
you get into tens of millions of records the memory requirements
become a problem. For instance, running a 64bit Python can
actually be important as you may need more than 4GB of RAM.

Also, for very large files, the time taken to build the index gets
longer - so having to reindex the file each time can become an
issue. Saving the index to disk solves this, and can also lets us
avoid keeping the whole lookup table in memory.

> Few days ago I tried to store millions of data using SQLite and I
> found it very slow, although my code it is not optimized (I'm not a
> computer scientist but a biologist who like python and biopython).

If you search the Biopython development mailing list you'll see we've
already done some work using SQLite to store the file offsets. There
is an experimental branch on github here if you are curious BUT this
is not ready for production use:

http://github.com/peterjc/biopython/tree/index-sqlite

> However, as an alternative, I found the tokyocabinet library
> (http://1978th.net/tokyocabinet/) that is a modern implementation (in C)
> of DBM. There are a lot of python wrappers like tokyocabinet-python
> 0.5.0 (http://pypi.python.org/pypi/tokyocabinet-python/) that work
> efficiently and guarantee high speed and compression. Tokyocabinet
> implements hash databases, B-tree databases, table databases
> giving also the possibility to store info on disk or on memory. In
> case of table databases it should be able to index specific columns.

Tokyocabinet is certainly an interesting project, but this isn't the issue
that Bio.SeqIO.index() is trying to solve. You might be interested in
Brad's blog post from last year:

http://bcbio.wordpress.com/2009/05/10/evaluating-key-value-and-document-stores-for-short-read-data/

Regards,

Peter