[Biojava-dev] AlphabetManager.createSymbol(...)
David Huen
smh1008 at cam.ac.uk
Thu Feb 15 12:16:49 UTC 2007
On Feb 15 2007, mark.schreiber at novartis.com wrote:
>
>A similar suggestion has been made in the past for indexing SymbolLists in
>terms of BigInteger. How practical would such a large alphabet be? Eg
>unless you expect it to be pretty sparse in terms of the number of
>possible symbols that are actually seen you might get major problems with
>memory.
>
I think it is practical in the sense that even a simple (AA)^10 alphabet
will exceed the range of int but an alignment of 10 proteins may only be,
say, 1000 residues long so only a max of 1000 symbols will ever be
instantiated with much fewer needing to remain instantiated throughout the
run. I see less point for SymbolLists in that it seems unlikely that any
chromosome could have more than an int's worth of bases.
The main reason I need these huge alphabets is for 1-D HMMs that run over
genome alignments. I also hope to internally representing symbols in these
alphabets by BigInteger values of their alphabet index.
Incidentally, the SparseCrossProductAlphabet appeared to be caching every
symbol it was ever asked for and I have changed that to a WeakValueHashMap
internally now.
Regards,
David
--
David Huen
Dept of Genetics
University of Cambridge
CB2 3EH
U.K.
More information about the biojava-dev
mailing list