[BioPython] codons and complements

Andrew Dalke dalke@bioreason.com
Tue, 12 Oct 1999 02:38:19 -0600


Thomas.Sicheritz@molbio.uu.se said:
> Why not hardcoding the Universal alphabet plus some frequent
> variants, like standard mitochondria and chloroplasts. For other
> alphabets, I think the user/programmer knows best what to choose
> and when -  so she/he could pass an optional parameter (dictionary,
> list, flag) containing the information about the differing codons.

Isn't that what I said, though in a more roundabout way?  :)
My proposal was that non-standard translation data be accessible in two
different ways:

1) if the "translate" function is called with a non-None 2nd parameter,
use that as the translation code.  This is the optional parameter part
of your statement.

2) if the 2nd parameter is None, and the sequence contains the
attribute "codon_table", use that for the translation information.
This is what I consider to be the useful/novel part of my proposal.

3) if there is no "codon_table" attribute, use the universal
translation code.  This is the default case you want.

Consider when the translate function is called.  If the only way
to get an alternate codon table is to pass it in as a parameter,
then there must be some data structure used in the calling function
which stores both the sequence and the codon table.

What I want to do is allow for that information to be optionally
stored as part of the sequence itself, and thus make life (I hope)
a little easier for people using the translate function.

As a case scenerio, consider getting sequence, pulling off a
subsequence, getting the reverse complement, and translating to
protein:

  seq_record = get_sequence_from_database("SomethingX")
  seq = seq_record.seq
  rc_seq = revcomp(seq[100:400])
  protein = translate(rc_seq)
  print str(protein)

Now imagine that the code needs to support different translations,
like bacterial.  If the codon table information is stored as part
of the sequence (as I am proposing), then the above example does
not need to change because it is available from the sequence
stored in the seq_record.

However, if the information is only available from the sequence
record, then the code would need to be something like:

  seq_record = get_sequence_from_database("SomethingX")
  seq = seq_record.seq
  rc_seq = revcomp(seq[100:400])
  protein = translate(rc_seq, seq_record.codon_table)
  print str(protein)

It's a little more complicated since it needs to pass the codon
table around.  That little bit may not seem like much, but it means
that most programs won't have to worry about special cases with
different translation tables.  This makes programs easier to read and
write.

In addition, to be correct for different alphabets, the above
code is still wrong, since it needs to know if this is the revcomp
of DNA or RNA.  This information will have to be in the sequence
record, so the code would likely be a change of

  rc_seq = revcomp(seq[100:400])

into

  rc_seq = revcomp(seq[100:400], seq_record.complement_table)

or even

  if seq_record.seq_type == "dna":
    rc_seq = revcomp(seq[100:400], DNA_complement_table)
  elif seq_record.seq_type == "rna":
    rc_seq = revcomp(seq[100:400], RNA_complement_table)
  else:
    assert "Unknown sequence type: %s" % `seq_record.seq_type`

But my first example code, when using my proposal, still works
even with this complication!

> (e.g. Under some special conditions, UGA can code for selenocysteine
> instead for Stop, but only the user knows when :-)

See, here's where my background as a physicist and not biologist
pops up.  In this case, will UGA *sometimes" code for selenocysteine
and at other times for a stop codon?

If so, I can envision two solutions:
  1) don't use the standard translate function, since the algorithm
doesn't support this case
  2) pass in only the sequence up to the stop codon, and use a
translation table which maps "UGA" to "X"  (this is my approach)

If not, that is, if UGA always translates to SEC, then pass in the
right translation table and, ummm, tell me how translation stops
in real life; does it use UAA or UAG instead?


> A translation call should ONLY translate and not try to guess about
> biological meanings. A stop codon should be translated into a '*' - at
> least I am expecting that. But I also know situations where I'd like to
> raise an exception if the sequence shows up to contain stop codons.
> Another flag ? translate(seq,'-forceCDS') ?.

But that gets us back to the problem I'm trying to solve, which is,
how can we be clearer about which alphabets we're using?  I would
like to have something which starts with an RNA sequence defined
using the IUPAC alphabet and create a protein sequence also defined
with the IUPAC alphabet.  "*" is not part of the IUPAC definition
for a protein, since * isn't an amino acid.

In fact, once you start allowing "*" in a sequence, there are
some problems I can see, like, what is the length of "PYTH*N"? 5?
Or 6?

I do realize that people want to do exactly what you're saying, and
I want to point out that the solution I'm proposing supports what
you want.

For example, suppose the translation table doesn't list "UGA" and you
want it to use "*" for all elements it doesn't support.  Then you
can do a translation like:

class MarkMissingCodon:
  def __init__(self, codon_table):
    self.data = codon_table
  def __getitem__(self, key):
    return self.data.get(key, "*")
      
class MarkCodonTable:
  def __init__(self, codon_table):
    self.codon_table = MarkMissingCodon(codon_table)
    self.output_seq_constructor = ProteinSeq

translate(seq, MarkCodonTable(seq.codon_table))

and that will convert the restrictive definition I gave into the
more permissive one.  In this case, if the lookup fails, a "*"
is returned in its place.  As another solution, it is also possible
to have a list of non-restrictive codon tables just like I want a
list of permissive ones.

I do want to point out that in general I like being restrictive
about things -- in software that is.  I've found that not doing so
leads to problems for the future.  In addition, less restrictive
solutions can be built from more restrictive ones, but the other
way around is usually more difficult to do.

Also, I discourage people from having flags in function calls which
trigger different types of behaviour.  I would much rather have
two different functions (well, okay, I would much rather have one
function).  The problem is, the "if" statements start to get too
complicated and hard to test.  I've seen a lot of functions where
there were 4 or more different flags passed as input and some of
the "if" statements were never tested -- easy to tell since the
actual code of those cases didn't even work!

Compare that to the wrapper solution I gave above, which doesn't
have any if statements and produces the desired behaviour.  (To
be more precise, what I did was push the if statement out of the
function and into its caller.)

> >   o  what do you do if the input size is not a factor of 3?  Ignore
> > the last 1 or 2, or raise an error?
> IMHP ignore the rest unless -forceCDS given.

Hmm, according to the rules I listed above, I would always raise
an exception and force the caller to pass in a sequence of length % 3.

A lenient wrapper to it could be implemented as

def permissive_translate(s, codon_table = None):
  translate(s[:len(s)-len(s)%3], codon_table)

But this is a case where I'm happy to ignore the last 1 or 2
untranslated codons.

> Next question should we use 'clever' translation ? (or greedy
> translation?) Synonymous codons, like CTA, CTT, CTG and gives
> Leucine(L). Should CTN (maybe from a sequencing error) recognize
> the codon family and translate CIN to L ?

I was thinking that, yes, this should be allowed.  In my proposed
scheme, there would be a translation table for IUPAC ambiguous RNAs
which mapped
  CTA -> L
  CTT -> L
  CTC -> L
  CTG -> L
and also all
  CTN -> L
  CTR -> L
   ...

but for cases like
  CAT -> H
  CAC -> H
  CAG -> Q
  CAA -> Q
would have
  CAY -> H
  CAR -> Q
but leave "CAN", etc. to be undefined.

Hmmm, this also suggests that "UGA" be defined as None to indicate
that it is a valid codon, but is flagged to be the Stop codon, while
not being defined means that it cannot be translated.  Alas, this
level of distinction is identical to Perl's "defined" vs. "exists",
which can lead to some problems.  (Ahh, I don't think there will
be problems.)

The ambiguous RNA would also have tables listing things like
"Y" <-> "R" for doing reverse complements.

Also, another question.  What's a "greedy translation"?  Is that
another name for a "clever translation"?

To summarize, I'm trying to 1) support generic algorithms like
"reverse complement" and "translate" in such a way that, if
extra information about a sequence is known, that data can be
used automatically, and 2) if different beaviours are needed (which
they are) then it is easy to adapt the code for the different
needs.

So I believe everything you (Thomas) want can easily be built
from the mechanisms I've proposed.

						Andrew Dalke
						dalke@acm.org