[BioPython] codons and complements

Sun, 10 Oct 1999 15:47:40 -0600

I've been thinking about how to do translations and complements of
sequences.  I've got an idea about how to do them "nicely" that I
want to propose.

I'll start with translation.  Given a nucleotide sequence and
a codon table, return a protein sequence.  The translation will
always be 3 nucleotides to 1 amino acid.  The DNA/RNA and the
protein will have their own alphabets.

There are several things to worry about:
  o  is it DNA or RNA (slightly different alphabet)?

  o  which translation codes do you want to use?  Eg,
ftp://ncbi.nlm.nih.gov/entrez/misc/data/gc.prt lists 15 different
encodings.

  o  what do you do with stop codons?  Produce multiple proteins
or just stop translating?  Or raise an error?

  o  what do you do if the input size is not a factor of 3?  Ignore
the last 1 or 2, or raise an error?

  o  will the returned protein always be the same concrete type?
For example, we've mentioned the different types of alphabets
available and I proposed having a more general "seq_types" attribute
listing the interfaces supported by the sequence.  If that proposal
is to be useful, then there must be a way to set the seq_types for
the newly created protein sequence.

I would like to embed most of this information in the translation
table.  Something like

class UniversalDNA_TranslationTable:
  codon_table = {
    "TTG": "F", "TTC": "F", ... , "GGG": "G",
  }
  output_seq_constructor = IUPAC_UnambiguousProtein

class UniversalRNA_TranslationTable:
  codon_table = {
    "UUG": "F", "UUC": "F", ..., "GGG": "G",
  }
  output_seq_constructor = IUPAC_UnambiguousProtein

So an implementation of "translate" might look like:

def translate(seq, trans_table = UniversalRNA_TranslationTable):
  s = seq.seq  # get the underlying string representation
  lookup = trans_table.codon_table

  # If the size is not divisible by 3, ignore the last nucleotides
  letters = []
  for i in range(0, len(s)-3, 3):
    letters.append(lookup[s[i:i+3]])

  new_seq = string.join(letters, "")
  return trans_table.output_seq_constructor(new_seq))

This "solves" problem 2 by requesting an explicit translation table, 4
by ignoring extra characters, and 5 by passing in the appropriate
constructor.

I don't know the right solution for 3 (stop codons).  Any solution
is easy enough to implement, though I prefer raising an exception.

I would like to expand things a bit.  It is possible, when you've
created the sequence, to tell the translation table neede for it.
For example, GenBank stores the translation table (if not the
(not so) universal code) as one of the fields.  Thus, it would be
nice to have the translation table stored along with the sequence,
and have the translation routine know about it.

For example, suppose you define the sequence as:

class DNASeq:
  def __init__(self, seq, translation_table=UniversalDNA_TranslationTable):
     self.seq = seq
     self.translation_table = translation_table

then the implementation of translation could be:

def translate(seq, trans_table = None):
  # Note: the 3 term getattr is, I believe, in 1.5.2.  Otherwise, use
  # try/except AttributeError.
  if trans_table is None:
    trans_table = getattr(seq, "translation_table", \
                          UniversalRNA_TranslationTable)

  # The rest of this implementation is the same as previous

Thus, if the input sequence knows its translation table; that data
is used, unless overridden by the caller.

This is a solution to problem 1, the "which alphabet is this" problem.

On the implementation side, the sequences would probably store
the trans_table information as a class variable, and have:

class Seq:
  def __init__(seq):
    self.seq = seq

class UniverasalDNASeq(Seq):
  translation_table = UniversalDNA_TranslationTable

This is useful since then you don't have this extra reference in
every created sequence (it's stored in the __class__ and not the
instance's __dict__).

I have a slight concern about cycles.  The nucleotide sequences have
a reference to the translation table, and the translation table has
a reference to the protein sequence constructor.  Thus, there are no
cycles in the objects, but there may problems with imports, if the
standard DNA and protein sequence definitions are done in the same
module (say, Bio.Seq), then it will import the translation tables
module (say, Bio.TransTables), which will import Bio.Seq to get the
proteins.

There are several solutions to this problem.  1) I think Python
allows this, so long as the import is not actually used to get
data from a partially loaded module, 2) Split the DNA and protein
sequences into their own modules, 3) hide everything through string
labels, set up a registry of strings to classes, and put the
import inside the registry.

I hope the first of these works, but I've not tested it.

What do you all think of this solution?

Oh, and there have also been the question, why don't we make the
translation functionality available as a method of the sequence
object, with an optional parameter containing the translation table
to use?  (Rather than this proposal, which makes translation be
done via an external function.)

My answer is that I like light-weight objects, that is, objects
with little functionality.  They are easier to understand, and
allow for more, and easier, implementations.  For example, suppose
you wanted to wrap a CORBA/LSR sequence object for use as a biopython
sequence object.  If you implement "translate" as a method, then
you'll have to get the sequence as a string and do the translation.
That means you reimplement code since the standard biopython sequence
already has that code and the only thing different is the way to get
the sequence as a string.
  So what will happen, to prevent duplication, is a function will be
created which takes a string and a translation table and returns the translated
string.  But this is almost exactly what my implementation
does anyway.  Thus, making the functionality be a method doesn't
really gain much, and just adds (IMHO) to excess code.

The other half of my subject line is how to make a reverse complement
of a given string.  The algorithm is easy, given an input string and
a dictionary mapping a character to its inverse (so for DNA, "A" to
"T", and for RNA "A" to "U", and for the ambiguous codes, "W" to "S"),
traverse the input string backwards, building up the new string from
the mapping of each character.

Again, there is a problem of getting the alphabet mappings, and of
getting the correct object to use for the return type.

I would like to implement this in the same fashion, that is, each
sequence object (or its class) contains a "complement" attribute
which has the character mappings, and has the class to use for
getting the constructor.

class UnambiguousDNAComplement:
  table = {"A": "T", "T": "A", "C": "G", "G": "C" }
  constructor = IUPAC_UnambiguousDNA

class IUPAC_UnambiguousDNA(seq):
  complement = DNAComplement

so the rc code looks like

def reverse_complement(seq, complement = None):
  if complement is None:
     complement = getattr(seq, "complement", UnambiguousRNAComplement)

  s = list(seq.seq)
  s.reverse()
  new_s = string.join(map(complement.table.get, s), "")

  return complement.constructor(new_s)

Actually, however, I think there is a cyclic problem here since
the UnambiguousDNAComplement refers to IUPAC_UnambiguousDNA and
vice versa.  If this is serious (and I don't think it is since
it means the classes won't get destroyed, and I don't think they
do anyway, and there are few of them) then the last bit of code
can be replaced with

  klass = complement.constructor
  if klass is None:
    klass = seq.__class__
  return klass(new_s)

This will force the class to be the same as the input sequence type,
but assumes the input class takes a string as a constructor.

I want things to work this way since not all sequences take a string
as an input type.  For example, a OMG Sequence cannot, but you would
like to get a complement for it anyway.  All that wrapper code needs
to do, then, is set the "complement" attribute to the right type

class CORBA_UnambiguousDNAComplement(UnambiguousDNAComplement):
  constructor = IUPAC_UnambiguousDNA # was "None" to prevent cycles

class CORBA_DNASeq:
  complement = CORBA_UnambiguousDNAComplement

or if determined dynamically, something like

class CORBA_Seq:
  def __getattr__(self, key):
    if key == "complement":
       if self.type == "dna":
         return CORBA_UnambiguousDNAComplement
       elif self.type == "rna":
         return CORBA_UnambiguousRNAComplement
     raise AttributeError, key

Again, thoughts? discussion?  I personally think the idea is very
cool, useful and general.  I like it :)

						Andrew
						dalke@acm.org