[Biojava-dev] Suggestion for Canonical Symbols

Thomas Down td2@sanger.ac.uk
Mon, 9 Dec 2002 01:02:34 +0000


On Mon, Dec 09, 2002 at 01:08:44PM +1300, Schreiber, Mark wrote:
> Would it be useful to use the Integer/ SubInteger model where protein
> alphabet is a sub alphabet of protein-term?

That's certainly a possibility.  The problem comes with sorting
out the SymbolTokenizations.  Currently, the Protein-TERM
tokenization has an "X" symbol which includes all possibly
symbols _including TER_, which is different from the "X"
in PROTEIN.  Now, I don't know if this is important to anyone.
In fact, personally I can't help wondering if it's wrong.
On the other hand, thr translation of the coding "nnn" does
include our hypothetical termination symbol.

This doesn't prevent us going the sub-alphabet route, but
it does mean we need to be careful to make sure everything
ends up with the right tokenization.

     Thomas.