[Dynamite] Is this working now then?

Ian Holmes ihh@fruitfly.org
Sun, 5 Mar 2000 18:47:09 -0800 (PST)


On Mon, 6 Mar 2000, Ewan Birney wrote:

> And we are done for sequences
> 
> (except - where does this fecking alphabet shit go?)

the fecking Alphabet erse goes in the sequence module -- right?

looking back at it... when I first wrote it I suggested that we need ways
of adding symbols to an Alphabet -- but I no longer think this is
necessary.

Here is the revised Alphabet IDL.
NB the comments about complementary alphabets at the end.

interface Alphabet
{
  // every symbol in the alphabet corresponds to a non-negative integer 0,1,2...(size-1)
  // negative integers represent ambiguous symbols, e.g. "N" "Y" "R" for DNA, "X" for proteins
  //  -1 is always 'N'
  // these ambiguities can be represented by a weighted sum over the real symbols in the alphabet
  //  e.g. "Y" = 0.5 * "C" + 0.5 * "T"
  //       "N" = 0.25 * ("A" + "C" + "G" + "T")
  // such weighted sums are represented by ProbabilityVectors
  //

  typedef sequence<float> ProbabilityVector;

  // here are the "essential" methods ;-)

  string name();              // name of alphabet e.g. "DNA"
  string alphabet_string();   // e.g. "acgt"

  int size();                     // returns alphabet_string().length()
  bool equal_to (in Alphabet a);  // compares the two alphabet_strings

  bool contains (in char c);   // NB we can't just have char2int(char) return -1 for unknown characters,
                               // since -1 = "N". If anyone thinks -1 should NOT mean "N", then speak up...

  int  char2int (in char c);  // maybe these methods should throw exceptions for unknown characters
  char int2char (in int i);   // ...just to affirm that -1 really does mean "N" & not "default return value"?

  int  unknown_int();     // returns -1 (should this be an enum instead?)
  char unknown_char();    // = int2char(-1); returns 'n' for DNA, 'x' for protein

  // the following are methods that are required to make rigorous mathematical sense of ambiguous characters

  ProbabilityVector char2vec (in char c);
  ProbabilityVector int2vec (in int i);

  char vec2char (in ProbabilityVector w);   // makes a best guess
  int vec2int (in ProbabilityVector w);     // makes a best guess

  ProbabilityVector unknown_vec();    // returns a vector containing (1/size) for each symbol

  // the following methods relate to complementary alphabets (i.e. DNA & RNA)
  // Could do this by having a ComplementaryAlphabet derived class -- but
  // then, is there any clear mechanism for testing if an Alphabet is a
  // ComplementaryAlphabet (i.e. does IDL have an equivalent of Java's
  // instanceof operator?)

  bool has_complement();   // TRUE for DNA & RNA; better handled by
                           // inheritance & an instanceof operator if poss.

  int  complement_int (in int i);
  char complement_char (in char c);       // this should preserve the case if possible, so that "A" maps to "T" and not "t"
  ProbabilityVector complement_vec (in ProbabilityVector vec);
};