[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Sat Jun 5 14:07:00 UTC 2010

On 05/06/10 15:02, biopython-request at lists.open-bio.org wrote:
> Laurent and Peter;
>
>> I do believe that building on HDF5 is a better approach:
>> - better use of resources (do not reinvent completely what is
>> already existing unless better)
>> - HDF5 is designed as a rather general storage architecture, and
>> will let one build tailored solutions when needed.
>
> HDF5 does has lots of good technical points, although as Peter mentions
> the lack of community uptake is a concern. To potentially explain this,
> here is my personal HDF5 usage story: I took an in depth look at PyTables
> for some large data sets that were overwhelming SQLite:
>
> http://www.pytables.org/moin
>
> The data loaded quickly without any issues, but the most basic thing
> I needed was indexes to retrieve a subset of the data by chromosome
> and position. Unfortunately, you can't create indexes without
> buying the Pro edition:
>
> http://www.pytables.org/moin/HintsForSQLUsers#Creatinganindex
>
> That immediately killed my ability to share the script so I ended
> my HDF5 experiment and reworked my SQLite approach.

PyTables is already a dialect of HDF5 (not necessarily readable by other 
HDF5 software/libraries, and the "Pro Edition" adds indexing 
capabilities, I think.

h5py is the alternative.

Also, indexing (as in "hash function + tree") can be done using SQLite, 
and both (HDF5 and SQLite) can complement each other very efficiently.
[I have designed and implemented ad-hoc hybrid solution at several 
occasions, and never regretted it so far]

> Also, echoing Peter, the BioHDF download warns you that the code is
> not stable, tested, or supported:

Not tested is not good, but that's mostly a matter of having unit tests.

Also I am referring to using HDF5 (mature, tested), not necessarily 
BioHDF as an higher layer (which I have no experience at all with). 
Should BioHDF not have tests and release cycles, it will probably not be 
the answer for me either.

Along those lines, a very recent post advertising for a position at FHRC 
(bioconductor's group) suggests that HDF5 (and netCDF) are directions 
considered over there as well.

> http://www.hdfgroup.org/projects/biohdf/biohdf_downloads.html
>
> BAM is widely used and has tools that are meant to work on it
> in production environments now, while HDF tool support still feels
> experimental.

I had that feeling with BAM/SAM tools at the time, and I new a bit my 
way around with HDF5.

> Sometimes it is best to be practical and keep an eye
> on other technical solutions as they evolve,

I am reading otherwise that not everyone using BAM/SAM is happy with it 
(and some threatening to fork).
I might well be wrong, but I don't think that BAM/SAM has (yet) a place 
so prominent that efforts should first go into converting to it.

> Brad