[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?

Peter biopython at maubp.freeserve.co.uk
Sat Jun 5 12:25:52 UTC 2010


On Sat, Jun 5, 2010 at 1:06 PM, Laurent <lgautier at gmail.com> wrote:
>>
>> There is some talk on the samtools mailing list about general improvements
>> to the chunking in BAM, relocating the header information (and other very
>> read specific things about representing error models, indels, etc). You
>> may be right that HDF5 has technical advantages over BAM version 1,
>> but currently my impression is that SAM/BAM is making good headway
>> with becoming a defacto standard for next generation data, while HDF5 is
>> not. Maybe someone should suggest they move to HDF5 internally for BAM
>> version 2?
>
> De-facto standards happen to become so because more people use them
> at some point (which may involve step during which a lot of people /believe/
> that most of the people are using a format over an other ;-) ), but this is
> indeed not necessarily making them the best technical solutions.

Absolutley.

> I do believe that building on HDF5 is a better approach:
> - better use of resources (do not reinvent completely what is already
> existing unless better)
> - HDF5 is designed as a rather general storage architecture, and will let
> one build tailored solutions when needed.
>
> I'd be surprised the BAM/SAM do not know about HDF formats, but I do not
> know for sure. Is there any BAM/SAM person reading ?

I've been subscribed to the samtools mailing list for a few weeks now. I think
we (or better yet the BioHDF team) should put this idea forward on their
mailing list. As I said, they appear to be discussing some fairly dramatic
changes to the internals of the BAM format (while intending to keep their
API as close as possible), so now would be a good time to consider a
switch from their blocked gzip system to something else like HDF instead.

Chris has pointed out some BioHDF people will be at BOSC 2010. There
is also a "HiTSeq: High Throughput Sequencing" ISMB 2010 SIG meeting
at the same time as BOSC 2010, so there could be some SAM/BAM
folk about in Boston to have some in person discussions with. Will you
be there is year Laurent (or at EuroSciPy or something else instead)?

Regards,

Peter



More information about the Biopython mailing list