[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Chris Fields cjfields at illinois.edu
Sat Jun 5 09:31:37 EDT 2010


On Jun 5, 2010, at 7:51 AM, Brad Chapman wrote:

> Laurent and Peter;
> 
>> I do believe that building on HDF5 is a better approach:
>> - better use of resources (do not reinvent completely what is
>> already existing unless better)
>> - HDF5 is designed as a rather general storage architecture, and
>> will let one build tailored solutions when needed.
> 
> HDF5 does has lots of good technical points, although as Peter mentions
> the lack of community uptake is a concern. To potentially explain this,
> here is my personal HDF5 usage story: I took an in depth look at PyTables
> for some large data sets that were overwhelming SQLite:
> 
> http://www.pytables.org/moin
> 
> The data loaded quickly without any issues, but the most basic thing
> I needed was indexes to retrieve a subset of the data by chromosome
> and position. Unfortunately, you can't create indexes without
> buying the Pro edition:
> 
> http://www.pytables.org/moin/HintsForSQLUsers#Creatinganindex
> 
> That immediately killed my ability to share the script so I ended 
> my HDF5 experiment and reworked my SQLite approach.
> 
> Also, echoing Peter, the BioHDF download warns you that the code is
> not stable, tested, or supported:
> 
> http://www.hdfgroup.org/projects/biohdf/biohdf_downloads.html
> 
> BAM is widely used and has tools that are meant to work on it 
> in production environments now, while HDF tool support still feels
> experimental. Sometimes it is best to be practical and keep an eye
> on other technical solutions as they evolve,
> 
> Brad

Yes, will be interesting to see how far along it is at BOSC.  

chris




More information about the Biopython mailing list