[BioPython] [Popgen] a binary format for genotypes
Kevin Teague
kteague at bcgsc.ca
Mon Dec 15 17:53:29 EST 2008
A lot of the headaches of dealing with large scale data sets in a
performance optimizing manner (self-describing format, platform
independant binary files) have been worked out in other fields of
science who've been dealing with large scale data sets for a lot
longer than the field of bioinformatics (e.g. astronomy and
climatology).
While I've only used it a little bit, so I can't comment if there are
any other formats that are worthy contenders, the HDF5 format is well
established for working with large scale data sets:
http://www.hdfgroup.org/HDF5/
There are libraries for accessing this format for many languages. With
Python there is PyTables, which is a very good library:
http://www.pytables.org/
I haven't heard of anyone using this in bioinformatics, but I've seen
it demonstrated in very high traffic financial application written in
Python where performance of this library was impressive. The developer
ported to PyTables after PostgreSQL became a bottle-neck and found
that PyTables was an order of magnitude faster. Of course, this isn't
a purely fair comparison, since PyTables gives up transactions,
concurrency and referential integrity in favor of pure speed. But in
most data analysis pipelines, each data set can be produced
independantly of each other, so those features of a RDBMS aren't
usually needed.
There have been a number of other bioinformatics tools and libraries
that have been using custom binary file formats to deal with the ever
increasing size of bioinformatic data sets. From a sysadmin and
developer perspective it's a big headache since these custom formats
can be platform-sensitive and require compiling and installing
binaries to deal with each data format. Bleh!
I have yet to see a "custom bioinformatic binary file format" which
had to be developed to account for short comings of an already
existing binary file format ...
More information about the BioPython
mailing list