[Biojava-l] Serialization fallout from the Grand Symbol Change
Thomas Down
td2@sanger.ac.uk
Tue, 19 Dec 2000 14:06:59 +0000
Hi.
A couple of weeks on from the Grand Symbol Changes, everything
seemed to be going smoothly until...
Serialization of arbitrary symbols (be they AtomicSymbols,
BasisSymbols, or something else).
One of the important characteristics we've always tried to
preserve in BioJava is that the symbols contained within
FiniteAlphabets should always be singletons. This means:
- There is always exactly ONE Symbol object within
your java virtual machine representing, say, the
DNA `T' symbol.
- You can always compare symbols from FiniteAlphabets
using object identity (== operator), rather than
using the equals method.
Back in the dark ages, I was able to implement a system
whereby simple, atomic symbols (the only kind we had then)
like the DNA `T' could be serialized and deserialized
while preserving this object identity property (see
AlphabetManager.WellKnownSymbol if you're interested --
the idea is that instead of serializing the Symbol object
directly, we serialize a special `place holder' object. Upon
deserialization, this replaces itself with the corresponding
cannonical symbol). For a while everything worked nicely...
Having more complex symbol objects makes matter a /lot/
harder, though. Just as there is a single `T' symbol
in the DNA alphabet, there should be a singleton (T T)
symbol in the alphabet (DNA x DNA). And so on. We're
looking for a new way of canonicalizing arbitrary symbols.
Possible options are:
- Keep a pool of all known symbols in the AlphabetManager,
and use that for canonicalization. We've done this in
the past but it seems a /bad idea/ -- especially if
people start working with very complicated cross-product
alphabets. If the alphabet/symbol system is to scale,
the pools of canonical symbols need to be associated with
their containing alphabets, so they can be garbage-collected
once the alphabet is no longer in use.
Options which keep the canonical symbol pool in the Alphabet.
- Symbol objects keep a reference to their `primary' containing
alphabet. This makes serialization/deserialization relatively
easy. There has been resistance to this plan in the past,
though -- does anyone have any trouble with it now? From
where I'm standing this looks like the simplest solution which
is scalable and doesn't break anything too radically. But
it /does/ change the idea of what a symbol is slightly...
- The standard symbol implementations are no longer Serializable.
Objects which use Symbols (e.g. SymbolLists, Distributions)
have to provide explicit serialization code. In support of
this, we add a method to AlphabetManager to construct a
`place holder' object which encapsulates all the information
necessary to reconstitute a given symbol (including the
Alphabet to use for canonicalization purposes).
- Something else I've missed?
Any more thoughts on this? It's a tough one...
Thomas.
--
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
-- Terry Pratchett