[Biojava-dev] Suggestion for Canonical Symbols

Thomas Down td2@sanger.ac.uk
Mon, 9 Dec 2002 00:00:20 +0000


On Mon, Dec 09, 2002 at 11:59:01AM +1300, Schreiber, Mark wrote:
> Hi -
> 
> If you translate and RNA SymbolList into Protein the Symbols in the
> protein SymbolList come from the alphabet referenced by the
> ProteinTools.getTAlphabet.
> 
> The Symbols from the Talphabet are not canonical with the Symbols from
> the other protein Alphabet. This has lead to some very surprising bugs
> in some stuff we were developing. Given that Integer Symbols are now
> canonical even if they come from IntegerAlphabet or one of the
> Integer.SubAlphabets could the same happen for the protein Alphabets?

*sigh*

That was actually the original behaviour.  I broke it
(deliberately) a few weeks ago when fixing the knotty 
question of serializing ambiguous symbols, so now you know
who to blame.  At the time, requiring that all well-known
symbols should be scoped by Alphabet provided a sane way
of cleaning up the serialization code without having to write
totally new Symbol and Alphabet implementations for all the
well-known cases.  At least in the Protein/protein-term
case is probably does make sense to fix this.  I shall
ponder -- all suggestions welcome.

The division between protein and protein-term is really
rather articificial.  As far as I can tell, the termination
symbol is a bit like the gap symbol, in that it never occurs
in "biologically real" sequences, but is a useful convenience
for computation.  Maybe we'll be able to build on that idea for
BJ2 and get rid of the annoying distinction.

     Thomas.