[Biojava-l] New Wiki page
Matthew Pocock
mrp@sanger.ac.uk
Thu, 08 Feb 2001 11:34:38 +0000
Hi Paul,
A bit-compressed symbol-list implementation would be a good thing for
whole-chromosome analysis. There is an interface called AlphabetIndex in
the symbol package that maps an alphabet to/from integers. It should be
fairly easy to write a SymbolList implementation that uses one of these,
some bit-shifts and a byte-array to work out a relatively efficient
stoorage mechanism for alphabets with <= 8 symbols (e.g. DNA & RNA).
We would need to benchmark this - pointers are cheap, but the page
swapping is expensive. Bit-arithmetic potentialy costs more cpu, but you
can fit more sequence into one chunk of memory.
On the other hand, we get away with running analysis programs over
chromosome 1 (and 22 trivialy) by loading chunks of sequence on demand
(behind a SymbolList implementation) - Thomas is the one to bug about
this. Chunking byte-compressed sequences may be the optimal solution
though...
Matthew
Paul Edlefsen wrote:
> Speaking of who's doing what, I was considering writing an implementation of
> SymbolList that takes a nibble (or maybe a byte) per DNA base instead of a
> word. I've got this code in C++ and thought I'd port it over, though I
> haven't yet begun.
>
> Is anybody else working along similar lines? I need to read in multimegabase
> sequences and just 35Megabase Human chr.22 is too much for the current
> implementation, even increasing the heap to 128Megs. (This makes sense: 35 M
> bases * 4 bytes/base > 128 M bytes).
>
> Our goal is to make some open tools for whole-genome analysis and
> cross-species comparison. 35 Megabases is just the tip of the iceberg: to
> defend biojava to my peers I need to demonstrate that it can handle big
> sequences.
>
> :Paul