[BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object?

Bruce Southey bsouthey at gmail.com
Tue Oct 21 19:46:58 UTC 2008


Leighton Pritchard wrote:
> Hi Bruce,
>
> On 21/10/2008 15:13, "Bruce Southey" <bsouthey at gmail.com> wrote:
>   
>> Leighton Pritchard wrote:
>>     
>>> I don't see this, I'm afraid.
>>>
>>> Each codon -> one amino acid : one-one mapping
>>> Arg -> set of 6 possible codons : one-many mapping
>>>   
>>>       
>> If you believed this then your answer below is incorrect. The genetic
>> code allow for 1 amino acid to map to a three nucleotides but not any
>> three nor any more or any less than three.
>>     
>
> I'm fine with this bit.  Each such set of three nucleotides is called a
> 'codon'.  Six such codons are able to code for an arginine, as you note:
>
>   
>> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG)
>>     
>
> This is a one -> six mapping.  That is, one input (arginine), is capable of
> being back-translated into any of six possible outputs (CGT, CGC, CGA, CGG,
> AGA, or AGG).  
>
> but you contradict this with the comment:
>
>   
>> So to be clear there is a one
>> to one mapping between a codon and amino acid as well amino acid and a
>> codon. Therefore it is impossible for Arg to map to six possible codons
>>     
>
> I think that you're confusing the biological fact (only one codon actually
> encoded this amino acid) with the back-translation problem (in the absence
> of any other information, any one of six codons is equally likely to have
> encoded this amino acid).
>
> ---
>  
>   
>> This is still a one to one mapping between an amino acid and regular
>> expression relationship of the triplet that encodes it.
>>     
>
> Which is not the claim that I was making.  There are any number of ways of
> forcing a one-one mapping of this sort.  You could arguably represent it as
> a one-to-one mapping of 'arginine -> "the backtranslation of arginine"', but
> that would not be informative in reconstructing the actual coding sequence
> (if that was what you wanted - which is the point of the discussion: what is
> the point of a back_translate() method?).  The regular expression mapping is
> not useful for this, either.
>
>   
>> You are not representing the one to six mapping you indicated above as
>> sequence is composed of 300 nucleotides not 1800 as must occur with a
>> one to 6 codon mapping [...]
>>     
>
> I think you've misunderstood what's going on here.
>
> Imagine a reduced system, where there is only one amino acid - let's call it
> A - and there are two possible codons that can produce this amino acid - XXX
> and YYY (thanks, Coldplay).
>
> Now, if we have a 'sequence' of only one amino acid: 'A', that might have
> been encoded by the sequence 'XXX', or the sequence 'YYY'.  The sequence
> that coded for 'A' is one of 'XXX' or 'YYY', and we don't know which; there
> are two possibilities, therefore this is a 1->2 mapping.  2=2**1.  Note that
> the nucleotide sequence is 3*1=3 long.
>
> But if our sequence has two amino acids: 'AA', this could have been the
> result of 'XXXXXX', 'XXXYYY', 'YYYXXX', or 'YYYYYY'.  The coding sequence is
> one of four equally likely possibilities, and this is a 1->4 mapping (one
> sequence, four possible outcomes).  4=2**2, and the nucleotide sequence is
> 3*2 long.
>
> If we build longer sequences, we find that the number of potential outcomes
> is 2**n, where n is the number of 'A's in the input sequence, and the
> mapping is 1->2**n.  The nucleotide sequence is 3*n long.
>
> If we make this more general, where there are m codons for this amino acid,
> the number of potential outcomes is m**n, and the mapping is 1->m**n.  The
> nucleotide sequence is, again, 3*n long.
>
> In my previous example for arginine, m=6, n=100, the mapping is 1->6, and
> the sequence is 300nt long, *not* 1800 nt long.  There are still 6e77 ways
> of encoding a sequence of 100 arginines.  A back_translate() method that
> pretends to find the 'correct' coding sequence in the absence of other
> information, rather than 'a' coding sequence, is not making a plausible
> claim.
>
>   
Thank you for agreeing with me! I am glad that you realized that the 
genetic code prevents a true one to many relationship. In say relational 
databases where you can have one table for the journal issue and one 
table for the papers in it, you can get multiple papers in a single 
issue. Likewise, if we ignore the genetic code, there is one amino acid 
and one or more codons. However, the genetic code means that you only 
can select one of all the codons possible resulting in multiple 
combinations of one to one relationships.
>> It is not my position to argue what a user wants or how stupid I think
>> that the request is. The user would quickly learn.
>>     
>
> While it is entirely possible to implement a function called
> back_translate() that does something a user doesn't want or need, I'm not
> sure that it's the approach that should be taken, here.
>
> It is your position to argue what you want or need out of a back_translate()
> method, and why, so that other people can see your point of view, and maybe
> be swayed by it.  I don't see a use for such a method, even to produce a
> regular expression for searching nucleotide sequences, because TBLASTN is so
> much more efficient.
>   
This very much depends on how you want to use it. TBLASTN is not very 
good for very short sequences and can not handle protein domains/motifs 
such as those in Prosite.

>   
>> Isoleucine and Leucine are the worst case (there are a couple of others
>> that are close) because these have the same mass so you have to search for:
>> (TTA|TTG|CTT|CTC|CTA|CTG|ATT|ATC|ATA)
>>
>> If you are searching say for an RFamide, you know that you need at least
>> RFG, which means you need to do a query using regular expression on the
>> plus strand using:
>> (CGT|CGC|CGA|CGG|AGA|AGG)(TTT|TTC)(GGT|GGC|GGA|GGG)
>>
>> You then try to extend the match to more amino acids until you reach the
>> desired mass (hopefully avoiding any introns) or sufficiently that you
>> can use some other tool to help.
>>     
>
> I think that, in your position, I'd compare timings with a six-frame,
> three-frame or forward translation of (depending on the nature of the
> nucleotide sequence) the nucleotide sequence you're searching against, and
> then use a regular expression or string search with the protein sequence as
> the query.  That's likely to be significantly faster than a regex search
> with that many groups, with the effects more noticeable at larger query
> sequence lengths; particularly so if you cache or save the translated
> sequences for future searches.
>   
Thanks for the comments as I did not think about reusing the translation.
>    
>   
>>>> Another case where it would be useful is that tools like TBLASTN gives
>>>> protein alignments so you must open the DNA sequence and find the DNA
>>>> region based on the protein alignment.
>>>>         
>>> You could use TBLASTN output - which provides start and stop coordinates for
>>> the match on the subject sequence - to extract this directly, without the
>>> need for backtranslation.
>>>       
>
>   
>> Exactly my point, where is the DNA sequence?
>>     
>
> It's in the database against which you queried; TBLASTN queries against
> nucleotide databases.  Wait, that's not quite right - 
No, it is not even correct! :-)

> TBLASTN translates
> nucleotide databases into protein databases and queries against them with
> the protein sequence, partly because of the one-many mapping of
> back-translation.
>   
Not exactly as stop codons are not in protein databases except where 
they code for an amino acid.

> If the database is local, you can use fastacmd (part of BLAST) to dump the
> entire database, to retrieve the single matching sequence from the database,
> or even to extract only the region of the sequence that is the match.  Try
> fastacmd --help at the command-line.
>
> If your database is not local, you can (probably) obtain the sequence by
> querying GenBank with the accession number.  If you can't do that, or ask
> the people who compiled the database you're querying against, or if they
> won't let you have the sequence, then you're stuck with guessing the coding
> sequence.
>
>   
>> Only if you have direct access to the DNA sequence can you get it.
>>     
>
> That's not true; fastacmd can extract FASTA-formatted sequences from any
> (version number compatibilities notwithstanding) correctly-formatted BLAST
> database.
>
>   
Obviously because you still have direct access to the DNA sequence.
>> Furthermore, the DNA sequence
>> must be exactly the same because any change in the coordinates screws it
>> up.
>>     
>
> I don't see how that is a great concern.  The coordinates of the match would
> come from the same database you were searching, so should match.  If your
> database is up-to-date, and you have to go to GenBank, then you should have
> the most recent revision of the sequence in there, anyway.
>
> Even if both of the above options fail, and you can acquire the new sequence
> by some accession identifier, you can build a new local database from that
> sequence alone, and find where the match is.  Or translate and search
> directly in Python.
>   

These were some of the things that one was trying to avoid, especially 
repeating it all over again and hoping like crazy that it is still 
present. (Genome assemblies are not very forgiving.)
> If you truly have no access to the DNA sequence (e.g. if it's proprietary
> information, you can't access the BLAST database, and no-one will send you
> the sequence) then, and only then, are you stuck with guessing the coding
> sequence in *very* large parameter space.
>
> Best,
>
> L.
>
>   
Bruce




More information about the Biopython mailing list