[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Brad Chapman chapmanb at 50mail.com
Sat Jun 5 12:51:08 UTC 2010


Laurent and Peter;

> I do believe that building on HDF5 is a better approach:
> - better use of resources (do not reinvent completely what is
> already existing unless better)
> - HDF5 is designed as a rather general storage architecture, and
> will let one build tailored solutions when needed.

HDF5 does has lots of good technical points, although as Peter mentions
the lack of community uptake is a concern. To potentially explain this,
here is my personal HDF5 usage story: I took an in depth look at PyTables
for some large data sets that were overwhelming SQLite:

http://www.pytables.org/moin

The data loaded quickly without any issues, but the most basic thing
I needed was indexes to retrieve a subset of the data by chromosome
and position. Unfortunately, you can't create indexes without
buying the Pro edition:

http://www.pytables.org/moin/HintsForSQLUsers#Creatinganindex

That immediately killed my ability to share the script so I ended 
my HDF5 experiment and reworked my SQLite approach.

Also, echoing Peter, the BioHDF download warns you that the code is
not stable, tested, or supported:

http://www.hdfgroup.org/projects/biohdf/biohdf_downloads.html

BAM is widely used and has tools that are meant to work on it 
in production environments now, while HDF tool support still feels
experimental. Sometimes it is best to be practical and keep an eye
on other technical solutions as they evolve,

Brad



More information about the Biopython mailing list