[Bioperl-l] fastq splitter

Thu Mar 1 17:38:08 UTC 2012

On Mar 1, 2012, at 4:54 AM, Ewan Birney wrote:

>>>> 
>>>> Unaligned BAM makes the most sense.  I've also been talking with the
>>>> HDF5 folks here sporadically, they're still keen on promoting BioHDF
>>>> (it is pretty fast), though that has cooled considerably.
>>>> 
>>>> Anyone working directly with CRAM in their pipelines?
>>>> 
>>>> chris
>>> 
>>> I understand that Sanger are looking at moving their pipelines from BAM to
>>> CRAM later this year, but CRAM is still quite new and in flux.
>>> 
>>> Peter
>> 
>> Yeah, I wasn't sure how the community outside of Sanger is approaching this.  
>> 
> 
> A number of people are looking at this in different contexts. With the forthcoming
> 0.7 release, where arbitary tags are stored (in compressed form), and the already
> distributed optimised lossless compression in 0.6 it makes the first adoption of
> CRAM (being bascially a compressed BAM with no major loss of information - read
> names have to go, but that's it) smoother for people to adopt.
> 
> 
> In our hands this gives a 2-3 fold compression over BAMs (depending on what you
> have in the BAMs) with alot of this now being how many tags and how well those
> tags compress. The important thing though is that we have CRAMs without/less
> tags being 2 or 3 fold more than this (totally of 5 to 10 fold on current BAMs)
> but still lossless on bases plus quality. In the future - with lossy behaviour
> on qualities - this can go as low as 0.2 bits/base (bits/base - meaning the number
> of bits needed for a storage of a base including the quality model used is our
> preferred way of thinking about this).
> 
> 
> Check out the GR paper from last year:
> 
> http://ukpmc.ac.uk/articles/PMC3083090
> 
> 
> 
> And check out three blog posts on this:
> 
> http://genomeinformatician.blogspot.com/2011/05/compressing-dna-part-1.html
> 
> http://genomeinformatician.blogspot.com/2011/05/engineering-around-reference-based.html
> 
> http://genomeinformatician.blogspot.com/2011/05/compressing-dna-future-plan.html
> 
> 
> 
> And note the CRAM development list:
> 
> http://www.ebi.ac.uk/ena/about/cram_toolkit

Yep, already on it.  :)

Thanks for the blog pointer Ewan, didn't see those before.  We've been discussing options for storing data locally and may be centering on CRAM, though locally we have the HDF5 group as well (as I mentioned before), who have been promoting BioHDF for a bit now.  Not sure of the status on that as of yet.

chris