[Bioperl-l] fastq splitter
birney at ebi.ac.uk
Thu Mar 1 05:54:56 EST 2012
>>> Unaligned BAM makes the most sense. I've also been talking with the
>>> HDF5 folks here sporadically, they're still keen on promoting BioHDF
>>> (it is pretty fast), though that has cooled considerably.
>>> Anyone working directly with CRAM in their pipelines?
>> I understand that Sanger are looking at moving their pipelines from BAM to
>> CRAM later this year, but CRAM is still quite new and in flux.
> Yeah, I wasn't sure how the community outside of Sanger is approaching this.
A number of people are looking at this in different contexts. With the forthcoming
0.7 release, where arbitary tags are stored (in compressed form), and the already
distributed optimised lossless compression in 0.6 it makes the first adoption of
CRAM (being bascially a compressed BAM with no major loss of information - read
names have to go, but that's it) smoother for people to adopt.
In our hands this gives a 2-3 fold compression over BAMs (depending on what you
have in the BAMs) with alot of this now being how many tags and how well those
tags compress. The important thing though is that we have CRAMs without/less
tags being 2 or 3 fold more than this (totally of 5 to 10 fold on current BAMs)
but still lossless on bases plus quality. In the future - with lossy behaviour
on qualities - this can go as low as 0.2 bits/base (bits/base - meaning the number
of bits needed for a storage of a base including the quality model used is our
preferred way of thinking about this).
Check out the GR paper from last year:
And check out three blog posts on this:
And note the CRAM development list:
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
More information about the Bioperl-l