[BioPython] codons and complements

Iddo Friedberg idoerg@cc.huji.ac.il
Tue, 12 Oct 1999 14:30:51 +0200 (GMT+0200)


On Tue, 12 Oct 1999, Andrew Dalke wrote:

: Thomas.Sicheritz@molbio.uu.se said:
: > Why not hardcoding the Universal alphabet plus some frequent
: > variants, like standard mitochondria and chloroplasts. For other
: > alphabets, I think the user/programmer knows best what to choose
: > and when -so she/he could pass an optional parameter (dictionary,
: > list, flag) containing the information about the differing codons.
: 
: Isn't that what I said, though in a more roundabout way?:)
: My proposal was that non-standard translation data be accessible intwo
: different ways:
: 
: 1) if the "translate" function is called with a non-None 2nd parameter,
: use that as the translation code.This is the optional parameter part
: of your statement.
: 
: 2) if the 2nd parameter is None, and the sequence contains the
: attribute "codon_table", use that for the translation information.
: This is what I consider to be the useful/novel part of my proposal.
: 
: 3) if there is no "codon_table" attribute, use the universal
: translation code.This is the default case you want.
: 

: 
: What I want to do is allow for that information to be optionally
: stored as part of the sequence itself, and thus make life (I hope)
: a little easier for people using the translate function.
: 
: As a case scenerio, consider getting sequence, pulling off a
: subsequence, getting the reverse complement, and translating to
: protein:
: 
: seq_record = get_sequence_from_database("SomethingX")
: seq = seq_record.seq
: rc_seq = revcomp(seq[100:400])
: protein = translate(rc_seq)
: print str(protein)
: 


: Now imagine that the code needs to support different translations,
: like bacterial.If the codon table information is stored as part
: of the sequence (as I am proposing), then the above example does
: not need to change because it is available from the sequence
: stored in the seq_record.
: 
: However, if the information is only available from the sequence
: record, then the code would need to be something like:
: 
: seq_record = get_sequence_from_database("SomethingX")
: seq = seq_record.seq
: rc_seq = revcomp(seq[100:400])
: protein = translate(rc_seq, seq_record.codon_table)
: print str(protein)
: 
: It's a little more complicated since it needs to pass the codon
: table around.That little bit may not seem like much, but it means
: that most programs won'thave to worry about special cases with
: different translation tables.This makes programs easier to read and
: write.
: 
: In addition, to be correct for different alphabets, the above
: code is still wrong, since it needs to know if this is the revcomp
: ofDNA or RNA.This information will have to be in the sequence
: record, so the code would likely be a change of
: 

Generally, RNA alphabet is not used. Non of the
databases I know of hold the RNA alphabet, even for RNA sequences. 
sequences, even for functional RNA (rRNA, ribozymes, hnRNA, tRNA, and bits
of mRNA) are represented in the databases by the DNA alphabet.

However, maybe we should make allowances for RNA alphabet. Why?
because scientists working on RNA would like, at the end of the day, to
present their sequences using the RNA alphabet. This, however, should be
as user-transparent as possible.
Simple solution: discard the idea of an RNA alphabet altogehter. 99% of the
cases it is not needed.

Simple solution 2: Add a asRNA() method. This'll cause each T to be replaced
by U for representation purposes. Sequence will still be kept in the DNA
alphabet.

Ambitious solution:
-------------------
Maybe just add 2 methods, which will give __repr__ the sequence in DNA or
RNA alphabet. The user may choose one explicitly. If the user doesn't, the
sequence will be __repr__'d according to a ("hidden") switch which
designates it as RNA or DNA. Switch will be set automatically at the
constructor, according to whether there's a U or a T, and may be
overridden.

>>> MyRNA = DNASeq("AGUCCUGUAAA")

>>> print MyRNA
AGUCCUGUAAA
>>> print MyRNA.asDNA()
AGTCCTGTAAA

>>> MyFunnyDNA = DNASeq("ACGTTGTGACCCUGT",seqtype=DNA)  #This DNA has a U
>>> print MyFunnyDNA
ACGTTGTGACCCTGT

>>> print MyFunnyDNA.asRNA()
ACGUUGUGACCCUGU

However, this scheme is not a perfect one: the information on where the U
is was lost, and that might be important. (A C-->U transition mutation 
in the DNA perhaps? This does happen).

I do not think that such cases should be taken care of at the level of the
sequence. Rather, it's an annotation job.

In any case, the sequence itself shall be kept as DNA, for all intents and
purposes, which saves us the translation problem, and a plethora of others
when dealing with 2 nucleotide alphabets.

__repr__ will look at the switch before returning the
string, to know whether it should be RNA or DNA alphabet.

Brute force solution:
Or maybe we should just have two nucleotide subclasses: RNA and DNA.


: rc_seq = revcomp(seq[100:400])
: 
: into
: 
: rc_seq = revcomp(seq[100:400], seq_record.complement_table)
: 
: or even
: 
: if seq_record.seq_type == "dna":
:   rc_seq = revcomp(seq[100:400], DNA_complement_table)
: elif seq_record.seq_type == "rna":
:   rc_seq = revcomp(seq[100:400], RNA_complement_table)
: else:
:   assert "Unknown sequence type: %s" % `seq_record.seq_type`
: 
: But my first example code, when using my proposal, still works
: even with this complication!
: 
: > (e.g. Under some special conditions, UGA can code for selenocysteine
: > instead for Stop, but only the user knows when :-)
: 
: See, here's where my background as a physicist and not biologist
: pops up.In this case, will UGA *sometimes" code for selenocysteine
: and at other times for a stop codon?
: 

Yes, that's right. And whether a given UGA will be read as a coding or
STOP is, as Thomas pointed out, a user decision. (In reality, this
depends on the mRNAs secondary structure, which might be altered by a
mutation).


: If so, I can envision two solutions:
: 1) don't use the standard translate function, since the algorithm
: doesn't support this case
: 2) pass in only the sequence up to the stop codon, and use a
: translation table which maps "UGA" to "X"(this is my approach)
: 
: If not, that is, if UGA always translates to SEC, then pass in the
: right translation table and, ummm, tell me how translation stops
: in real life; does it use UAA or UAG instead?
: 

As said before, UGA is still a stop codon, the environment determines it's
interpretation, in this particular case.

Also, there are other mistranslations of specific codons, which are
environment-dependent. The user/caller should handle that.


: > A translation call should ONLY translate and not try to guess about
: > biological meanings. A stop codon should be translated into a '*' - at
: > least I am expecting that. But Ialso know situations where I'd like to
: > raise an exception if the sequence shows up to contain stop codons.
: > Another flag ? translate(seq,'-forceCDS') ?.
: 
: But that gets us back to the problem I'm trying to solve, which is,
: how can we be clearer about which alphabets we're using?I would
: like to have something which starts with an RNA sequence defined
: using the IUPAC alphabet and create a protein sequence also defined
: with the IUPAC alphabet."*" is not part of the IUPAC definition
: for a protein, since * isn't an amino acid.
: 
: In fact, once you start allowing "*" in a sequence, there are
: some problems I can see, like, what is the length of "PYTH*N"? 5?
: Or 6?
: 


Once you have a sequence with a '*' in it, it really stops having any 
biological meaning. So it shouldn't be made into a PEPSeq object anyhow.
(Which, as I opined last week, should be the output of the translate
method: not a string, but a PEPSeq instantiation).

: I do realize that people want to do exactly what you're saying, and
: I want to point out that the solution I'm proposing supports what
: you want.
: 
: For example, suppose the translation table doesn't list "UGA" and you
: want it to use "*" for all elements it doesn't support.Then you
: can do a translation like:
: 

[snip...]



: 
: Also, I discourage people from having flags in function calls which
: trigger different types of behaviour.I would much rather have
: two different functions (well, okay, I would much rather have one
: function).The problem is, the "if" statements start to get too
: complicated and hard to test.I've seen a lot of functions where
: there were 4 or more different flags passed as input and some of
: the "if" statements were never tested -- easy to tell since the
: actual code of those cases didn't even work!
: 

I agree. But we should have both options, the restrictive an the
permissive one. The restrictive one should be the one the user
defaults to. That is, when naming them we should have ``translate()''
(which raises an exception) and the other one, which obviously, cannot
return 'PYTH*N' as a PEPSeq instantiation, but only as a string. Unless we
translate the stop codon into an ``X'', if the user wishes to force this.

The buck should stop, IMHO, when creating sequence objects. A non-peptide
sequence should not be able to graduate into a sequence object, unless the
*-->X scheme is forced. (This is getting a bit wild...)


: Compare that to the wrapper solution I gave above, which doesn't
: have any if statements and produces the desired behaviour.(To
: be more precise, what I did was push the if statement out of the
: function and into its caller.)
: 



: > > o  what do you do if the input size is not a factor of 3?  Ignore
: > > the last 1 or 2, or raise an error?
: > IMHP ignore the rest unless -forceCDS given.
: 
: Hmm, according to the rules I listed above, I would always raise
: an exception and force the caller to pass in a sequence of length % 3.
: 
: A lenient wrapper to it could beimplemented as
: 
: def permissive_translate(s, codon_table = None):
: translate(s[:len(s)-len(s)%3], codon_table)
: 
: But this is a case where I'm happy to ignore the last 1 or 2
: untranslated codons.

In this case, the default use should be the lenient one: ignore the last
two, unless for some reason, that should be explicitly checked. Why?
Because, unlike the previous case, here we can return a viable PEPSeq
object despite the "error". And in may cases, the user will not want to
check for that error, or the caller can check it easily enough.



Got to think over the rest...

Iddo

--

/* --- */main(c){float t,x,y,b=-2,a=b;for(;b-=a>2?.1/(a=-2):0,b<2;
/*  |  */putchar(30+c),a+=.0503) for(x=y=c=0;++c<90&x*x+y*y<4;y=2*
/*  |  */x*y+b,x=t)t=x*x-y*y+a;}
/* --- ddo Friedberg */