[Dynamite] Is this working now then?
Ian Holmes
ihh@fruitfly.org
Sun, 5 Mar 2000 18:47:09 -0800 (PST)
On Mon, 6 Mar 2000, Ewan Birney wrote:
> And we are done for sequences
>
> (except - where does this fecking alphabet shit go?)
the fecking Alphabet erse goes in the sequence module -- right?
looking back at it... when I first wrote it I suggested that we need ways
of adding symbols to an Alphabet -- but I no longer think this is
necessary.
Here is the revised Alphabet IDL.
NB the comments about complementary alphabets at the end.
interface Alphabet
{
// every symbol in the alphabet corresponds to a non-negative integer 0,1,2...(size-1)
// negative integers represent ambiguous symbols, e.g. "N" "Y" "R" for DNA, "X" for proteins
// -1 is always 'N'
// these ambiguities can be represented by a weighted sum over the real symbols in the alphabet
// e.g. "Y" = 0.5 * "C" + 0.5 * "T"
// "N" = 0.25 * ("A" + "C" + "G" + "T")
// such weighted sums are represented by ProbabilityVectors
//
typedef sequence<float> ProbabilityVector;
// here are the "essential" methods ;-)
string name(); // name of alphabet e.g. "DNA"
string alphabet_string(); // e.g. "acgt"
int size(); // returns alphabet_string().length()
bool equal_to (in Alphabet a); // compares the two alphabet_strings
bool contains (in char c); // NB we can't just have char2int(char) return -1 for unknown characters,
// since -1 = "N". If anyone thinks -1 should NOT mean "N", then speak up...
int char2int (in char c); // maybe these methods should throw exceptions for unknown characters
char int2char (in int i); // ...just to affirm that -1 really does mean "N" & not "default return value"?
int unknown_int(); // returns -1 (should this be an enum instead?)
char unknown_char(); // = int2char(-1); returns 'n' for DNA, 'x' for protein
// the following are methods that are required to make rigorous mathematical sense of ambiguous characters
ProbabilityVector char2vec (in char c);
ProbabilityVector int2vec (in int i);
char vec2char (in ProbabilityVector w); // makes a best guess
int vec2int (in ProbabilityVector w); // makes a best guess
ProbabilityVector unknown_vec(); // returns a vector containing (1/size) for each symbol
// the following methods relate to complementary alphabets (i.e. DNA & RNA)
// Could do this by having a ComplementaryAlphabet derived class -- but
// then, is there any clear mechanism for testing if an Alphabet is a
// ComplementaryAlphabet (i.e. does IDL have an equivalent of Java's
// instanceof operator?)
bool has_complement(); // TRUE for DNA & RNA; better handled by
// inheritance & an instanceof operator if poss.
int complement_int (in int i);
char complement_char (in char c); // this should preserve the case if possible, so that "A" maps to "T" and not "t"
ProbabilityVector complement_vec (in ProbabilityVector vec);
};