[BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object?

Peter biopython at maubp.freeserve.co.uk
Wed Oct 22 15:33:00 UTC 2008


Bruce wrote:
>>>> For completeness as these are not 100% correct,
>>>> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN
>>>> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV
>>>> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN

Just for the record, in addition to the debate about the final equal
signs above, there is at least one error in the above - for the
leucine codons, (TTN|CTR) should read (TTR|CTN), but this doesn't
matter for the discussion in hand.

Bruce wrote:
> Leighton does show these are correct:
> (CGV | AGR) == MGV
> and MGV ==(CGV | AGR)

I don't think Leighton did mean to say that. A set of 6 codons is NOT
equal to a set of 8 codons.  However, if we say "sub set" or "super
set" here things are probably fine (I haven't double checked the
correct ambiguity codes are used here).

Similarly, Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTR|CTN) covers 6
unambiguous codons.
This is a subset of YTN = (TTC|TTA|TTG|TTT|CTC|CTA|CTG|CTT) which
covers 8 unambiguous codons.

Having back_translate("L") == "YTN" means
translate(back_translate("L")) == "X", which would surprise many.
Using "YTN" covers all the codons plus some extra ones.  This might be
useful for searching purposes, but otherwise its very misleading.

Having back_translate("L") == "CTN" means
translate(back_translate("L")) == "L", but doesn't cover the two
codons TTR (i.e. TTA or TTG).  At least this is better than
back_translate("L") == "TTR" which still has
translate(back_translate("L")) == "L", but doesn't cover the four
codons CTN.  Picking any one of the six codons also ensures
translate(back_translate("L")) == "L" but of course doesn't cover the
other five codons.  In all three cases, the utility of the back
translation is limited.

> Yes, I am also very aware that this creates a problem for doing a
> translate(back_translate(sequence)) without using a special translation
> table (yet another reason for not including it in Seq object or just return
> an exception).

Yes.

> As I pointed in your other thread that I do not believe that a
> back-translation should be part of the Seq object.

In the absence of a compelling use case, I agree.

> If for no other reason
> than back-translation just creates too many ambiguous nucleotides in one DNA
> sequence. This will cause some of the algorithms to determine protein or DNA
> sequences to fail (back_translate('AFLFQPQRFGR') gives
> 'GCNTTYYTNTTYCARCCNCARMGVTTYGGNMGV', which causes
> NCBI's online BLASTN to say it is protein).

In such cases, you can explicitly tell BLAST (or other tools) if they
are using nucleotides or proteins.  However this is a valid concern
for working with ambiguous nucleotides.

As an aside, zen of python "In the face of ambiguity, refuse the
temptation to guess." (here nucleotide versus protein)

> In anycase, BLAST and such are not very good at handling
> multiple ambiguous nucleotides in a sequence when probably
> one-third to one-half of the sequence would be ambiguous
> nucleotides.

Ambiguous searches are bound to be tricky.

Peter



More information about the Biopython mailing list