[BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object?

Wed Oct 22 08:31:12 UTC 2008

On 21/10/2008 21:36, "Bruce Southey" <bsouthey at gmail.com> wrote:

> For completeness as these are not 100% correct,
> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN
> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV
> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN

There are some difficulties with this encoding (IUPAC codes are at
http://www.chick.manchester.ac.uk/SiteSeer/IUPAC_codes.html)

YTN -> [CT]T[ACGT] -> {CTA, CTC, CTG, CTT, TTA, TTC, TTG, TTT}, two of which
do not encode leucine.

MGV -> [AC]G[ACG] -> {AGA, AGC, AGG, CGA, CGC, CGG}, of which AGC does not
encode arginine, and the resulting set does not include CGT, which does
encode arginine

WSN -> [AT][CG][ACGT] -> {ACA, ACC, ACG, ACT, AGA, AGC, AGG, AGT, TCA, TCC,
TCG, TCT, TGA, TGC, TGG, TGT}, of which 10 codons do not encode serine.

This would cause problems if we wanted to translate our back-translation
back to the original protein sequence (however we might want to do this).

> Ser is really so bad that one would suggest providing a strong warning
> and just use NTN, NGN, and NNN for Leu, Arg and Ser, respectively.

We could just backtranslate all amino acids to NNN and avoid the problem
entirely ;)

>> If we want to provide a simple string or Seq object, we can either
>> pick an arbitrary codon in each case (as in the first attachment on
>> Bug 2618), or perhaps represent some of the possible codons using
>> ambiguous nucleotides.
>> 
>> e.g.
>> back_translate("MR") = "ATGCGT" #arbitrary codon for R unambiguous
>> nucleotides
>> 
>> or,
>> back_translate("MR") = "ATGCGN" #arbitrary codon for R using ambiguous
>> nucleotides
>> 
>> Note in either example, the following nice property holds:
>> translate(back_translate("MR")) == "MR"

This would be an important consideration for a back_translate() method:
should translate() and back_translate() be inverse functions of each other?

I would say that this is a desirable property, or else a nested
translate(back_translate(translate(...(seq)...))) is likely to end up as a
string or sequence of ambiguity codons, which is not very useful.  If that
can't be done, then the opportunity to do so is probably best avoided...

To ensure that translate() and back_translate() are inverse functions, the
backtranslation of a particular amino acid should either return a single
unambiguous codon, or an ambiguous codon that cannot be translated to an
alternative amino acid (assuming a consistent codon table throughout).  If
we were not to choose arbitrarily an unambiguous codon, or subset of all
possible codons, then a representation of the ambiguity is required that is
not present in the Seq object, yet (e.g. For Ser, Leu or Arg as described
above).  A modification of translate() to spot, and accept such ambiguity
would be necessary.  This looks like harder work than it's worth.

>> It was something like this that I envisioned as a candidate for a Seq
>> method (based on the behaviour of the existing Bio.Translate
>> functionality), but only if such a simple back_translate
>> method/function had any real uses.  And thus far, I haven't seen any.
>>   
> For you perhaps but my reasons are very real to me!

I agree with Peter on this.  I don't see a single compelling use case for
back_translate() in a Seq object.

I can sort of see a potential use where, if you have a protein and want to
design a primer to the coding sequence (which is not known - otherwise there
are better ways to do this), then you might want to generate a sequence of
IUPAC ambiguity codes to guide primer design.  This might involve obtaining
a sequence only of the *certain* bases, e.g. Phe -> TTN; Ser -> NNN; Gly ->
GGN; Asp -> GAN, so that FGD -> TTNNNNGGN, and there are four of nine bases
around which primers might be designed.  However, I'm *really* stretching to
come up with this example.

I've outlined my views on some of the possible ways back_translate() might
work below:

Translate protein to its original coding sequence:
===================================================
Problem: this may be just guesswork in (very) large sequence space
Potential solution: guesswork may be guided by codon usage tables or user
preference for codons, but the biological utility/significance of the
result, which is still guessed at, is highly questionable.
Alternatives: If the originating organism's sequence is known, then TBLASTN
is fast, works well, and avoids the problem.  Alternatively, forward
translation followed by a search for the protein sequence is quicker and
less messy.

Translate protein to a single possible coding sequence (not necessarily
original):
============================================================================
Problem: Same one each time, or choose randomly? What is the point, anyway?
See above for solutions/alternatives

Translate protein to ambiguous representation (inverse translate and/or
return Seq):
============================================================================
Problem: changes required to the way sequences are represented in Seq
objects; this is a significant change at the heart of Biopython with many
inevitable side-effects.  Not clear how this would work, yet.
Potential solution: major coding upheaval and rewriting of Biopython
Alternatives: ignore the requirement that backtranslation is the inverse of
translation; do not return a Seq object, but instead store the
backtranslation as an attribute, or just return a string for the user to do
what they want with

Translate protein to ambiguous representation (not inverse of translate, do
not return Seq):
============================================================================
Problem: what's the point?  agreeing which ambiguous representation to use:
regex, IUPAC, something else; IUPAC ambiguities aren't a convenient
representation for Ser, Leu, Arg;
Potential solution: just use a regex; allow a choice; make an executive
decision; ignore it and hope it goes away

I think that the last behaviour here is the only one that is feasible, but I
still don't see much point in implementing it.  At least turning a protein
sequence into a regex of possible codons would be quick to code...

>> There is
>> also the option of returning multiple simple strings or Seq objects
>> (either as a list or preferable a generator) giving all possible back
>> translations, 

Eek! (for the reasons you mention)

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405

______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________