[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?
Laurent
lgautier at gmail.com
Sat Jun 5 10:07:00 EDT 2010
On 05/06/10 15:02, biopython-request at lists.open-bio.org wrote:
> Laurent and Peter;
>
>> I do believe that building on HDF5 is a better approach:
>> - better use of resources (do not reinvent completely what is
>> already existing unless better)
>> - HDF5 is designed as a rather general storage architecture, and
>> will let one build tailored solutions when needed.
>
> HDF5 does has lots of good technical points, although as Peter mentions
> the lack of community uptake is a concern. To potentially explain this,
> here is my personal HDF5 usage story: I took an in depth look at PyTables
> for some large data sets that were overwhelming SQLite:
>
> http://www.pytables.org/moin
>
> The data loaded quickly without any issues, but the most basic thing
> I needed was indexes to retrieve a subset of the data by chromosome
> and position. Unfortunately, you can't create indexes without
> buying the Pro edition:
>
> http://www.pytables.org/moin/HintsForSQLUsers#Creatinganindex
>
> That immediately killed my ability to share the script so I ended
> my HDF5 experiment and reworked my SQLite approach.
PyTables is already a dialect of HDF5 (not necessarily readable by other
HDF5 software/libraries, and the "Pro Edition" adds indexing
capabilities, I think.
h5py is the alternative.
Also, indexing (as in "hash function + tree") can be done using SQLite,
and both (HDF5 and SQLite) can complement each other very efficiently.
[I have designed and implemented ad-hoc hybrid solution at several
occasions, and never regretted it so far]
> Also, echoing Peter, the BioHDF download warns you that the code is
> not stable, tested, or supported:
Not tested is not good, but that's mostly a matter of having unit tests.
Also I am referring to using HDF5 (mature, tested), not necessarily
BioHDF as an higher layer (which I have no experience at all with).
Should BioHDF not have tests and release cycles, it will probably not be
the answer for me either.
Along those lines, a very recent post advertising for a position at FHRC
(bioconductor's group) suggests that HDF5 (and netCDF) are directions
considered over there as well.
> http://www.hdfgroup.org/projects/biohdf/biohdf_downloads.html
>
> BAM is widely used and has tools that are meant to work on it
> in production environments now, while HDF tool support still feels
> experimental.
I had that feeling with BAM/SAM tools at the time, and I new a bit my
way around with HDF5.
> Sometimes it is best to be practical and keep an eye
> on other technical solutions as they evolve,
I am reading otherwise that not everyone using BAM/SAM is happy with it
(and some threatening to fork).
I might well be wrong, but I don't think that BAM/SAM has (yet) a place
so prominent that efforts should first go into converting to it.
> Brad
More information about the Biopython
mailing list