[Biojava-dev] Protein alphabet names

Matthew Pocock matthew_pocock@yahoo.co.uk
Tue, 22 Oct 2002 11:50:56 +0100 (BST)


 --- Thomas Down <td2@sanger.ac.uk> wrote: > Hi...
> 
> I've been working to tidy up the alphabet
> bootstrapping code
> in AlphabetManager.java, initially with the aim of
> reducing
> startup overhead of constructing a big DOM tree, but
> it's
> turned into a bit more of a refactoring and
> rationalizing
> exercise.

This code is very old (archealogical?) and it's great
that you're going through it.

> 
> One thing I noted is that there's a rather
> significant inconsistency
> in how symbols are named.  For nucleic acids, the
> name is the
> actual chemical name of the base -- adenine,
> guanine, etc.  However,
> for proteins we use three-letter code (ALA, GLN). 
> This dates
> back to the days when symbols just had `long' and
> `short' forms,
> and we decided that for proteins the most important
> representations
> were 3-letter and 1-letter codes.  However, we now
> have separat
> SymbolTokenization objects, which mean that this is
> no longer
> so much of an issue.  What I propose for 1.3 is to:
> 
>   - Make the name field the actual name of the amino
> acid
>     (alanine, glutamine).
> 
>   - Add an additional tokenization (probably called
> "three-letter"
>     unless someone comes up with a better
> suggestion) for people
>     who actually want 3-letter codes.
> 
> I understand that this change might break a few
> programs -- this should
> be pretty easy to correct for, though.
> 
> Does anyone have any objections to this?
> 

I have no problems with this as long as apps that
could be using different tokenizations before/after
the change fail spectacularly and there is some
document telling you how to trivialy fix the error.

Matthew


__________________________________________________
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com