[BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object?

Tue Oct 21 14:13:15 UTC 2008

Leighton Pritchard wrote:
> On 17/10/2008 19:46, "Bruce Southey" <bsouthey at gmail.com> wrote:
>
>   
>> Leighton Pritchard wrote:
>>     
>>> This is the key problem.  Forward translation is - for a given codon table -
>>> a one-one mapping.  Reverse translation is (for many amino acids) one-many.
>>> If the goal is to produce the coding sequence that actually encoded a
>>> particular protein sequence, the problem is combinatorial and rapidly
>>> becomes messy with increasing sequence length.
>>>   
>>>       
>> If you use a regular expression or a tree structure then there is a
>> one-one mapping but then that would probably best as a subclass of Seq.
>>     
>
> I don't see this, I'm afraid.
>
> Each codon -> one amino acid : one-one mapping
> Arg -> set of 6 possible codons : one-many mapping
>   
If you believed this then your answer below is incorrect. The genetic 
code allow for 1 amino acid to map to a three nucleotides but not any 
three nor any more or any less than three. So to be clear there is a one 
to one mapping between a codon and amino acid as well amino acid and a 
codon. Therefore it is impossible for Arg to map to six possible codons 
as only one is correct. Under the standard genetic code, each amino acid 
can be represented in an regular expression either as the bases or 
ambiguous nucleotide codes:
Ala/A =(GCT|GCC|GCA|GCG) = GCN
Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR)
Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR)
Lys/K =(AAA|AAG) = AAR
Asn/N =(AAT|AAC) =AAY
Met/M =ATG =ATG
Asp/D =(GAT|GAC) =GAY
Phe/F =(TTT|TTC) =TTY
Cys/C =(TGT|TGC) =TGY
Pro/P =(CCT|CCC|CCA|CCG) =CCN
Gln/Q =(CAA|CAG) =CAR
Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY)
Glu/E =(GAA|GAG) = GAR
Thr/T =(ACT|ACC|ACA|ACG)  =ACN
Gly/G =(GGT|GGC|GGA|GGG) =GGN
Trp/W =TGG  =TGG
His/H =(CAT|CAC)  = CAY
Tyr/Y =(TAT|TAC) = TAY
Ile/I =(ATT|ATC|ATA) =ATH
Val/V =(GTT|GTC|GTA|GTG) =GTN

This is still a one to one mapping between an amino acid and regular 
expression relationship of the triplet that encodes it. Unfortunately 
the ambiguous nucleotide codes can not be used directly in a regular 
expression search.

> It doesn't matter how it's represented in code, the problem of a one-many
> mapping still exists for amino acid -> codon translation in most cases.
>
> The combinatorial nature of the overall problem can be illustrated by
> considering the unlikely case of a protein that comprises 100 arginines.
> The number of potential coding sequences is 6**100 = 6.5e77.  That you *can*
> choose any one of these to be your potential coding sequence doesn't negate
> the fact that there are still (6.5e77)-1 other possibilities... It doesn't
> get much better if you use the the average number of codons per amino acid:
> 61/20 ~= 3.  A 100aa protein would typically have 3**100 ~= 5e47 potential
> coding sequences.  I wouldn't want to guess which one was correct, and I
> can't see a back_translate method in this instance doing more than producing
> a nucleotide sequence that is potentially capable of producing the passed
> protein sequence, but for which no claims can be made about biological
> plausibility.
>   
You are not representing the one to six mapping you indicated above as 
sequence is composed of 300 nucleotides not 1800 as must occur with a 
one to 6 codon mapping. Rather you have provided the number of 
combinations of the six codons that can give you 100 Args based on a one 
to one mapping of one codon to one Arg.  If you use ambiguous nucleotide 
codes, you can reduce it down to 1.267651e+30 potential coding sequences 
for 100 amino acids as a worst case scenario.

It is not my position to argue what a user wants or how stupid I think 
that the request is. The user would quickly learn.
> Now, a back_translate() that takes a protein sequence alignment and, when
> passed the coding sequences for each component sequence, returns the
> corresponding alignment of the nucleotide sequences, makes sense to me.  But
> that's a discussion for Bio.Alignment objects...
>
>   
>> I would suggest tools like Wise2 and exonerate
>> (http://www.ebi.ac.uk/~guy/exonerate/) are the solution to solving gene
>> structure problems than using a Seq object.
>>     
>
> I wouldn't suggest using a Seq object for this purpose, either... ;)
>
>   
>>> I agree - I can't think of an occasion where I might want to back-translate
>>> a protein in this way that wouldn't better be handled by other means.  Not
>>> that I'm the fount of all use-cases but, given the number of ways in which
>>> one *could* back-translate, perhaps it would be better not to pick/guess at
>>> any single one.
>>>   
>>>       
>> Apart from the academic aspect, my main use is searching for protein
>> motifs/domains, enzyme cleavage sites, finding very short combinations
>> of amino acids and binding sites (I do not do this but it is the same)
>> in DNA sequences especially genomic sequence. These are usually very
>> small and, thus, unsuitable for most tools.
>>     
>
> I do much the same, and haven't found a pressing use for back-translation,
> yet - YMMV.
>
>   
>> One of my uses is with
>> peptide identification and de novo sequencing using mass spectrometry
>> when you don't know the actual protein or gene sequence. It also has the
>> problem that certain amino acids have very similar mass so you would
>> need to  Regardless of whether you use a regular expression query or not
>> you still need a back translation of the protein query and probably the
>> reverse complement.
>>     
>
> Perhaps I'm being dense, but I don't see why that is.  Can you give an
> example?
>   
Isoleucine and Leucine are the worst case (there are a couple of others 
that are close) because these have the same mass so you have to search for:
(TTA|TTG|CTT|CTC|CTA|CTG|ATT|ATC|ATA)

If you are searching say for an RFamide, you know that you need at least 
RFG, which means you need to do a query using regular expression on the 
plus strand using:
(CGT|CGC|CGA|CGG|AGA|AGG)(TTT|TTC)(GGT|GGC|GGA|GGG)

You then try to extend the match to more amino acids until you reach the 
desired mass (hopefully avoiding any introns) or sufficiently that you 
can use some other tool to help.
>   
>> Another case where it would be useful is that tools like TBLASTN gives
>> protein alignments so you must open the DNA sequence and find the DNA
>> region based on the protein alignment.
>>     
>
> You could use TBLASTN output - which provides start and stop coordinates for
> the match on the subject sequence - to extract this directly, without the
> need for backtranslation.  Example output where subject coordinates give the
> match location below:
>
>   
>> ref|NC_004547.2| Erwinia carotovora subsp. atroseptica SCRI1043, complete
>>     
> genome
>           Length = 5064019
>
>  Score =  731 bits (1887), Expect = 0.0
>  Identities = 363/376 (96%), Positives = 363/376 (96%)
>  Frame = +3
>
> Query: 1      MFHXXXXXXXXXXXXXTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY
> 60
>               MFH             TISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY
> Sbjct: 477432 MFHLPKLKQKPLALLLTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY
> 477611
>
> [...]
>
> L.
>
>   
Exactly my point, where is the DNA sequence? Only if you have direct 
access to the DNA sequence can you get it. Furthermore, the DNA sequence 
must be exactly the same because any change in the coordinates screws it 
up.

Bruce