[BioPython] back-translation method for Seq object?

Fri Oct 17 18:46:19 UTC 2008

Leighton Pritchard wrote:
> On 16/10/2008 16:11, "Peter" <biopython at maubp.freeserve.co.uk> wrote:
>
>   
>> Quoting from the recent thread about adding a translation method to
>> the Seq object, Bruce brought up back-translation:
>>
>> Peter wrote:
>>     
>>> Bruce wrote:
>>>       
>>>> Obviously reverse translation of a protein sequence to a DNA sequence is
>>>> complex if there are many solutions.
>>>>         
>
> This is the key problem.  Forward translation is - for a given codon table -
> a one-one mapping.  Reverse translation is (for many amino acids) one-many.
> If the goal is to produce the coding sequence that actually encoded a
> particular protein sequence, the problem is combinatorial and rapidly
> becomes messy with increasing sequence length.  And that's not considering
> the problem of splice variants/intron-exon boundaries if attempting to
> relate the sequence back to some genome or genome fragment - more a problem
> in eukaryotes.
>   
If you use a regular expression or a tree structure then there is a 
one-one mapping but then that would probably best as a subclass of Seq. 
Note you still would need a method to transverse it if you wanted to get 
a sequence from it as well as an reverse complement. It is fairly 
trivial to get a regular expression for it for the standard genetic code 
but I did not get my reverse complement to work satisfactory nor did I 
try to get DNA sequence from the regular expression.

I would suggest tools like Wise2 and exonerate 
(http://www.ebi.ac.uk/~guy/exonerate/) are the solution to solving gene 
structure problems than using a Seq object.

Obviously if you start with a DNA sequence, then you could create object 
that has a DNA/RNA Seq object and a protein Seq object(s) that contain 
the translation(s) like in Genbank DNA records that contain the 
translation. But that really avoids the issue here.

>>> Yes, back-translation is tricky because there is generally more than
>>> one codon for any amino acid.  Ambiguous nucleotides can be used to
>>> describe several possible codons giving that amino acid, but in
>>> general it is not possible to do this and describe all the possible
>>> codons which could have been used.  This topic is worth of an entire
>>> thread... for the record, I would envisage a back_translate method for
>>> the Seq object (assuming we settle on translate as the name for the
>>> forward translation from nucleotide to protein).
>>>       
>> Do we actually need a back_translate method?  Can anyone suggest an
>> actual use-case for this?  It seems difficult to imagine that any
>> simple version would please everyone.
>>     
>
> I agree - I can't think of an occasion where I might want to back-translate
> a protein in this way that wouldn't better be handled by other means.  Not
> that I'm the fount of all use-cases but, given the number of ways in which
> one *could* back-translate, perhaps it would be better not to pick/guess at
> any single one.
>   
Apart from the academic aspect, my main use is searching for protein 
motifs/domains, enzyme cleavage sites, finding very short combinations 
of amino acids and binding sites (I do not do this but it is the same) 
in DNA sequences especially genomic sequence. These are usually very 
small and, thus, unsuitable for most tools. One of my uses is with 
peptide identification and de novo sequencing using mass spectrometry 
when you don't know the actual protein or gene sequence. It also has the 
problem that certain amino acids have very similar mass so you would 
need to  Regardless of whether you use a regular expression query or not 
you still need a back translation of the protein query and probably the 
reverse complement.

Another case where it would be useful is that tools like TBLASTN gives 
protein alignments so you must open the DNA sequence and find the DNA 
region based on the protein alignment.

Bruce