[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Ernesto e.picardi at unical.it
Mon Jun 7 09:10:26 UTC 2010


Hi all,

I followed the interesting discussion about indexing. I think that it is a hot point given the huge amount of data released by the new sequencing technologies. I never used the Bio.SeqIO.index() but I'd like to test it and I'd like also to know how to use it. Is there a simple tutorial?
In the past I tried pytables based on HDF5 library and I was impressed by its high speed. However, the indexing is not supported at least for the free version. Moreover, strings of not fixed length cannot easily handled and stored. For example, in order to store EST sequences you need to know a priori the maximum length in order to optimize the storage. As an alternative, VLAs (variable length arrays) could be used but the storing performance goes down quickly.
Few days ago I tried to store millions of data using SQLite and I found it very slow, although my code it is not optimized (I'm not a computer scientist but a biologist who like python and biopython). However, as an alternative, I found the tokyocabinet library (http://1978th.net/tokyocabinet/) that is a modern implementation (in C) of DBM. There are a lot of python wrappers like tokyocabinet-python 0.5.0 (http://pypi.python.org/pypi/tokyocabinet-python/) that work efficiently and guarantee high speed and compression. Tokyocabinet implements hash databases, B-tree databases, table databases giving also the possibility to store info on disk or on memory. In case of table databases it should be able to index specific columns.

Hope this help,

Ernesto 


Il giorno 07/giu/2010, alle ore 10.01, Renato Alves ha scritto:

> Quoting Peter on 06/04/2010 09:42 AM:
>> Unfortunately I was inconsistent about which order I used in my email
>> (gzip vs on disk indexes) so I'm not sure which you are talking about.
>> Are you saying supporting on disk indexes would be your priority (even
>> though you did ask look at gzip support in the past)?
> 
> Yes exactly. The gzip support became a non priority at least for our
> current local uses. On the other hand, disk support would be quite helpful.
> As a matter of fact we borrowed a little of your SeqIO.index() sqlite
> code you have on a github branch.
> 
> Renato.
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython





More information about the Biopython mailing list