[BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object?

Wed Oct 22 15:04:29 UTC 2008

Peter wrote:
> On Wed, Oct 22, 2008 at 9:31 AM, Leighton Pritchard <lpritc at scri.ac.uk> wrote:
>   
>> On 21/10/2008 21:36, "Bruce Southey" <bsouthey at gmail.com> wrote:
>>
>>     
>>> For completeness as these are not 100% correct,
>>> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN
>>> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV
>>> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN
>>>       
>
> I was going to jump up and down and disagree with you here Bruce, but
> Leighton has already made the same point, (CGV | AGR) != MGV etc.
> It is true that the ambiguous codon MGV would cover all the possible
> Arg codons, but it includes more than that.  While this could be a
> useful thing for certain back-translation reasons, it does break the
> expectation that translate(back_translate(sequence)) == sequence
> [currently the behaviour available in Bio.Translate].
>   
Leighton does show these are correct:
(CGV | AGR) == MGV
and MGV ==(CGV | AGR)

BUT I fully agree that MGV does stand for other other codons that are do 
not translate for Arg as Leighton pointed out. This was why I prefixed 
this by stating "these are not 100% correct" so I am sorry that I was 
not clear enough.  Yes, I am also very aware that this creates a problem 
for doing a translate(back_translate(sequence)) without using a special 
translation table (yet another reason for not including it in Seq object 
or just return an exception).

As I pointed in your other thread that I do not believe that a 
back-translation should be part of the Seq object. If for no other 
reason than back-translation just creates too many ambiguous nucleotides 
in one DNA sequence. This will cause some of the algorithms to determine 
protein or DNA sequences to fail (back_translate('AFLFQPQRFGR') gives 
'GCNTTYYTNTTYCARCCNCARMGVTTYGGNMGV', which causes NCBI's online BLASTN 
to say it is protein). In anycase, BLAST and such are not very good at 
handling multiple ambiguous nucleotides in a sequence when probably 
one-third to one-half of the sequence would be ambiguous nucleotides.

Bruce