[BioPython] back-translation method for Seq object?

Fri Oct 17 08:24:43 UTC 2008

On 16/10/2008 16:11, "Peter" <biopython at maubp.freeserve.co.uk> wrote:

> Quoting from the recent thread about adding a translation method to
> the Seq object, Bruce brought up back-translation:
> 
> Peter wrote:
>> Bruce wrote:
>>> Obviously reverse translation of a protein sequence to a DNA sequence is
>>> complex if there are many solutions.

This is the key problem.  Forward translation is - for a given codon table -
a one-one mapping.  Reverse translation is (for many amino acids) one-many.
If the goal is to produce the coding sequence that actually encoded a
particular protein sequence, the problem is combinatorial and rapidly
becomes messy with increasing sequence length.  And that's not considering
the problem of splice variants/intron-exon boundaries if attempting to
relate the sequence back to some genome or genome fragment - more a problem
in eukaryotes.

>> Yes, back-translation is tricky because there is generally more than
>> one codon for any amino acid.  Ambiguous nucleotides can be used to
>> describe several possible codons giving that amino acid, but in
>> general it is not possible to do this and describe all the possible
>> codons which could have been used.  This topic is worth of an entire
>> thread... for the record, I would envisage a back_translate method for
>> the Seq object (assuming we settle on translate as the name for the
>> forward translation from nucleotide to protein).
> 
> Do we actually need a back_translate method?  Can anyone suggest an
> actual use-case for this?  It seems difficult to imagine that any
> simple version would please everyone.

I agree - I can't think of an occasion where I might want to back-translate
a protein in this way that wouldn't better be handled by other means.  Not
that I'm the fount of all use-cases but, given the number of ways in which
one *could* back-translate, perhaps it would be better not to pick/guess at
any single one.

Some choices to be made in deciding how to back-translate are (and I'm sure
you've already thought of them, but they're worth writing down):

I) Protein to unambiguous RNA:
  a) Codon table: arbitrary; organism-specific; user-defined?
  b) Codon choice: arbitrary and random; arbitrary and consistent; complete
set of possibilities; most common codon (if information available); other
favoured codon (if specified)?
II) Protein to ambiguous RNA:
  a) Return a Seq, string or some other representation of ambiguity?
  b) IUPAC ambiguity symbols; choice of codons; alternative representation
of ambiguity?

The most common back-translation I do is taking aligned protein sequences
back to their known coding sequences, and this is really a case of mapping
known codons onto predefined positions, rather than the interpolation of
unknown codons that is required for back-translation as implied above.
T-coffee handles this pretty well, IIRC.

To find coding sequences for a particular protein in the originating
sequence (if known), I use BLAST.  I guess there might be value in having
the ability to identify regions of the coding sequence that are least likely
to be variable (by generating them combinatorially) so that probes might be
designed if the coding sequence is not known.  But that doesn't appear to be
the way that most sequences are obtained these days: much cheaper to bung
RNA through 454 or Solexa and work through the output than to put someone on
the task of making an array of probes to find a sequence that may or may not
encode your sequenced protein...

> Bio.Translate (a semi-obsolete module whose deprecation has been
> suggested) provides a back_translate method which picks an essentially
> arbitrary but unambiguous codon for each amino acid.  Crude but
> simple.  A more meaningful choice would require suppling codon
> frequencies for the organism under consideration.

These can be found - for many organisms - in Emboss codon usage table (.cut)
files, if you have Emboss locally.  However, is requiring Emboss as a
dependency the cleanest or wisest solution for Biopython?  This approach
solves only one problem:  given a particular codon usage table, what is the
most likely sequence that would have produced this protein.  That's not a
problem I've ever come across in anger, but given a table of 'most efficient
codons' for some biological expression system, I can see this potentially
having some use.  However, given that many microbiologists can already tell
you the preferred codons for K12 without pausing for breath, I'm not sure
there's a problem looking for this solution.

> Other possibilities include using ambiguous nucleotides to try and
> cover all the possibilities (e.g. "L" -> "CTN"), but even here in some
> cases this is arbritary.  e.g. The standard three stop codons ['TAA',
> 'TAG', 'TGA'] could be represented as ['TAR', 'TGA'] or ['TRA', 'TAG']
> but not by a single ambiguous codon ('TRR' also covers 'TGG' which
> codes for 'W').

If Seq had an ambiguity-aware sequence representation, this could be
handled.  For example, a regular expression-based sequence representation
(which could lie alongside Seq.data, perhaps as Seq.regex) could represent
these variants as (TAA|TAG|TGA), and alternatively the usual ambiguity codes
could also be handled in a similar way (e.g. R as [AG]).  This would be of
some limited use, but would permit sequence searching within Biopython, at
least.

> Potentially of use would be a generator function which returned all
> possible back translations - but this would be complex and typically
> overkill.

I think that, for large sequences, this could quickly swamp the user.  What
do you see as the use of this output?

> As a final point, a Seq object back-translation method could give RNA
> or DNA.  From a biological point of view giving DNA by default would
> make sense.  This choice is handled in Bio.Translate when creating the
> translator object (part of what makes Bio.Translate relatively complex
> to use).

Since there is a one-one map of RNA to DNA, I'm easy about either choice on
a computational level.  Biologically-speaking, DNA -> RNA is transcription,
and RNA -> protein is translation, so I'd expect back-translation to convert
protein -> RNA, and back-transcription to convert RNA -> DNA.

Cheers,

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405

______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________