[Biojava-dev] Re: AlphabetManager problem?

Thomas Down td2 at sanger.ac.uk
Tue Oct 7 05:33:01 EDT 2003


On Tue, Oct 07, 2003 at 02:46:44PM +1300, Schreiber, Mark wrote:
> OK -
>  
> The serialization problem is caused by a readResolve method using a call to AlphabetManager.symbolForName(String name) which works fine for DNA and RNA but apparently barfs on Protein?
>  
> I think the problem is caused in the AlphabetManager.xml file where the DNA/RNA Symbols are delclared outside of the <alphabet> tag but they are declared inside the <alphabet> tag for the protein alphabets. I'm not brave enough to mess with that file myself (it may not even be the cause of the problem).


This has been a persistent problem.  AlphabetManager changed
radically several times in the 1.3 timeframe to try and sort
this out, but we never quite hit a solution that worked all the
time.  My personal favourite (and the one which came clostest
to working) was to keep *all* well-known symbols scoped by alphabet.
This got serialization working perfectly, but was shot down on
the basis that a symbol in the `pure' PROTEIN alphabet was not
identical to the corresponding symbol in the PROTEIN+TERMINATION
alphabet.  Arguably, the problem here is the existence of PROTEIN-TERM,
but its probably too late to change that...

How about we give all well known alphabets and symbols totally
unambiguous textual identifiers (perhaps LSIDs scoped in the open-bio.org
namespace) which are specified explicitly in AlphabetManager.xml.  Then
we just need one LSID->alphabet and one LSID->symbol map in
AlphabetManager, and resolve everything through that.

The LSIDs can be stored in the Annotation objects which are already
attached to alphabets and symbols (but not used for very much).

Serialization/deserialization code can go on the standard implementations
of Symbol and Alphabet.  I think we can probably ditch the magic
implementations for the well-known case.  The serialization support
code throws an exception if it can't find an LSID in the appropriate
place in the symbol/alphabet.

This means it ought to be possible to have user-created alphabets
which serialize sensibly, without things getting messy -- just specify
LSIDs and it will serialize safely.  No LSIDs == error on serialization.

Does this sound doable?  What's going to break if we try this?

     Thomas.


More information about the biojava-dev mailing list