[BioPython] codons and complements

Andrew Dalke dalke@bioreason.com
Tue, 12 Oct 1999 11:41:50 -0600


Iddo Friedberg <idoerg@cc.huji.ac.il> said:
> However, maybe we should make allowances for RNA alphabet. Why?
> because scientists working on RNA would like, at the end of the day,
> to present their sequences using the RNA alphabet.

That's my reason.  I've nearly convinced someone to start with
biopython and contribute some of his packages.  He does something
(don't ask, I'm not sure :) with using RNA secondary structure
prediction as a factor in determining phylogenies.

He definitely would like to deal with RNA sequences as a first
class object, since then he won't have to modify his tools to
deal with our data structure.

> Simple solution: discard the idea of an RNA alphabet altogehter.
> 99% of the cases it is not needed.

If we can deal with the 1% of cases where it is needed, but not
affect the complexity or generality of the rest of the code, then
I would like to handle it appropriately.

> Simple solution 2: Add a asRNA() method. This'll cause each T to
> be replaced by U for representation purposes. Sequence will still
> be kept in the DNA alphabet.

This makes the sequence object be modal, which is usually a bad
thing, especially once you start getting into multithreaded
programming.

A cleaner way, which would do almost exactly what you want, is
for "asRNA" to return an wrapper class which shares the same
underlying DNA sequence, but converts to RNA alphabet as needed.

> Brute force solution:
> Or maybe we should just have two nucleotide subclasses: RNA and DNA.

This is the solution I prefer, along with a function called something
like "transcribe" which turns DNA into RNA.  This is comparable to
your "asRNA" method.

Is there ever a case where people want to use a different alphabet
for DNA/RNA, such as the case brought up a while ago where there
is case sensitive meaning to the letters?

If so, then transcription has similar alphabet conversion problems
as translation.  But I think this is less common.

> Yes, that's right. And whether a given UGA will be read as a coding or
> STOP is, as Thomas pointed out, a user decision. (In reality, this
> depends on the mRNAs secondary structure, which might be altered by a
> mutation).
> [...]
> Also, there are other mistranslations of specific codons, which are
> environment-dependent. The user/caller should handle that.

Then this means there is no (simple) generic function function which
will do the correct translation?  If so, I'm willing to defer the
problem to a "selenocysteine_translate" or "environment_translate"
function which deals with these cases, and let the translate only
be concerned about 1 to 1 (or 3 to 1 :) mappings between nucleotides
and proteins.

Ummm, but then this lead us naturally to wanting "translate" be
a method of the object, since that knows the environment the best.
And I am still convinced that it shouldn't be a method, since that
makes the object more complicated than I think it needs to be.


> Once you have a sequence with a '*' in it, it really stops having
> any biological meaning. So it shouldn't be made into a PEPSeq object
> anyhow. (Which, as I opined last week, should be the output of the
> translate method: not a string, but a PEPSeq instantiation).

BTW, I now strongly agree with your opinion.

For the question of the "*", I'm willing to allow "*" as a valid
letter, since it is needed for real world applications, like a GUI
showing that a stop codon exists.  In that case, though, the output
is not an "IUPAC peptide sequence" but something else which fits
most of the definitions of a peptide.  The machinery supports it.


> The buck should stop, IMHO, when creating sequence objects. A
> non-peptide sequence should not be able to graduate into a sequence
> object, unless the *-->X scheme is forced. (This is getting a bit
> wild...)

Yes.  The created object should not have a "seq_type" of protein,
since it isn't a protein.  Since the interface is protein-like,
it is still possible for most things (like a GUI) to use it, but
the algorithms which require a protein are allowed to raise an
TypeError exception.

> In this case, the default use should be the lenient one: ignore
> the last two, unless for some reason, that should be explicitly
> checked. Why? Because, unlike the previous case, here we can return
> a viable PEPSeq object despite the "error". And in may cases, the
> user will not want to check for that error, or the caller can check
> it easily enough.

I agree with you.  I'm still wondering what makes it special enough
to go against my previous statements about wanting explicit error
checking.  Probably because I can't see any calling function really
care about the exception being thrown, especially when it is easy
enough to check with len(seq)%3!=0.

						Andrew Dalke
						dalke@acm.org