[Open-bio-l] OBDA redux? Compressed files

Sun Nov 13 12:30:37 UTC 2011

Hi again,

I've retitled this as it is a little off topic from the main OBDA redux thread,
http://lists.open-bio.org/pipermail/open-bio-l/2011-November/000819.html
http://lists.open-bio.org/pipermail/open-bio-l/2011-November/000820.html
http://lists.open-bio.org/pipermail/open-bio-l/2011-November/000821.html

As far as I recall, the original flat file and BDB based OBDA
specification for indexing sequencing files didn't cover
compressed files. That might be something to consider
(although we should sort of uncompressed text/binary
files first).

I've recently been experimenting with using compressed
files - in particular simple GZIP files (ignoring any block structure)
and BGZF (the specialised gzipped blocking used in BAM), see:

http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html
http://seqanswers.com/forums/showthread.php?t=15347

The virtual offset approach used in BGZF squeezes a 16 bit
within block offset (thus limiting you to 64kb blocks) and at
48 bit block start offset (thus limiting you to a 256TB file) into
a single 64bit "virtual" offset. That makes sense if you are
keeping the lookup table or many offsets in memory, and
can be used as is with code expecting a single offset (like
the current Biopython SQLite index schema).

Also bzip2 but this is block based, with the block size ranging
from 100KB to 900KB.

http://bzip.org/
http://bzip.org/1.0.5/bzip2-manual-1.0.5.html

I haven't tried any performance tests yet, which would
be interesting as I believe compression/decompression
of bfzip2 is more costly in CPU terms than gzip (although
both will be block size dependent).

If we wanted to imitate the BGZF virtual offset scheme for
arbitrary BZIP2 files, an alternative 64 bit virtual offset scheme
could use 20 bits to cover bz2 blocks of up to 900KB, leaving
64 - 20 = 44 bits for the start offset, thus limiting you to to just
2^44 bytes or 16Tb which sounds OK only in the medium term.
On the bright side this could be used to index any BZIP2 file
(under 16TB), whereas BGZF cannot be applied to any
GZIP file.

On the other hand, storing the block start and within block
separately is truly generic and could be used on any blocked
GZIP file (including BGZF) and BZIP2 etc. It would make
the SQLite schema a bit more complicated though.

Maybe something to consider for the next revision to OBDA,
and focus on the non-compressed case for now?

Regards,

Peter