[BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object?

Tue Oct 21 15:29:35 UTC 2008

Hi Bruce,

On 21/10/2008 15:13, "Bruce Southey" <bsouthey at gmail.com> wrote:
> Leighton Pritchard wrote:
>> I don't see this, I'm afraid.
>> 
>> Each codon -> one amino acid : one-one mapping
>> Arg -> set of 6 possible codons : one-many mapping
>>   
> If you believed this then your answer below is incorrect. The genetic
> code allow for 1 amino acid to map to a three nucleotides but not any
> three nor any more or any less than three.

I'm fine with this bit.  Each such set of three nucleotides is called a
'codon'.  Six such codons are able to code for an arginine, as you note:

> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG)

This is a one -> six mapping.  That is, one input (arginine), is capable of
being back-translated into any of six possible outputs (CGT, CGC, CGA, CGG,
AGA, or AGG).  

but you contradict this with the comment:

> So to be clear there is a one
> to one mapping between a codon and amino acid as well amino acid and a
> codon. Therefore it is impossible for Arg to map to six possible codons

I think that you're confusing the biological fact (only one codon actually
encoded this amino acid) with the back-translation problem (in the absence
of any other information, any one of six codons is equally likely to have
encoded this amino acid).

---

> This is still a one to one mapping between an amino acid and regular
> expression relationship of the triplet that encodes it.

Which is not the claim that I was making.  There are any number of ways of
forcing a one-one mapping of this sort.  You could arguably represent it as
a one-to-one mapping of 'arginine -> "the backtranslation of arginine"', but
that would not be informative in reconstructing the actual coding sequence
(if that was what you wanted - which is the point of the discussion: what is
the point of a back_translate() method?).  The regular expression mapping is
not useful for this, either.

> You are not representing the one to six mapping you indicated above as
> sequence is composed of 300 nucleotides not 1800 as must occur with a
> one to 6 codon mapping [...]

I think you've misunderstood what's going on here.

Imagine a reduced system, where there is only one amino acid - let's call it
A - and there are two possible codons that can produce this amino acid - XXX
and YYY (thanks, Coldplay).

Now, if we have a 'sequence' of only one amino acid: 'A', that might have
been encoded by the sequence 'XXX', or the sequence 'YYY'.  The sequence
that coded for 'A' is one of 'XXX' or 'YYY', and we don't know which; there
are two possibilities, therefore this is a 1->2 mapping.  2=2**1.  Note that
the nucleotide sequence is 3*1=3 long.

But if our sequence has two amino acids: 'AA', this could have been the
result of 'XXXXXX', 'XXXYYY', 'YYYXXX', or 'YYYYYY'.  The coding sequence is
one of four equally likely possibilities, and this is a 1->4 mapping (one
sequence, four possible outcomes).  4=2**2, and the nucleotide sequence is
3*2 long.

If we build longer sequences, we find that the number of potential outcomes
is 2**n, where n is the number of 'A's in the input sequence, and the
mapping is 1->2**n.  The nucleotide sequence is 3*n long.

If we make this more general, where there are m codons for this amino acid,
the number of potential outcomes is m**n, and the mapping is 1->m**n.  The
nucleotide sequence is, again, 3*n long.

In my previous example for arginine, m=6, n=100, the mapping is 1->6, and
the sequence is 300nt long, *not* 1800 nt long.  There are still 6e77 ways
of encoding a sequence of 100 arginines.  A back_translate() method that
pretends to find the 'correct' coding sequence in the absence of other
information, rather than 'a' coding sequence, is not making a plausible
claim.

> It is not my position to argue what a user wants or how stupid I think
> that the request is. The user would quickly learn.

While it is entirely possible to implement a function called
back_translate() that does something a user doesn't want or need, I'm not
sure that it's the approach that should be taken, here.

It is your position to argue what you want or need out of a back_translate()
method, and why, so that other people can see your point of view, and maybe
be swayed by it.  I don't see a use for such a method, even to produce a
regular expression for searching nucleotide sequences, because TBLASTN is so
much more efficient.

> Isoleucine and Leucine are the worst case (there are a couple of others
> that are close) because these have the same mass so you have to search for:
> (TTA|TTG|CTT|CTC|CTA|CTG|ATT|ATC|ATA)
> 
> If you are searching say for an RFamide, you know that you need at least
> RFG, which means you need to do a query using regular expression on the
> plus strand using:
> (CGT|CGC|CGA|CGG|AGA|AGG)(TTT|TTC)(GGT|GGC|GGA|GGG)
> 
> You then try to extend the match to more amino acids until you reach the
> desired mass (hopefully avoiding any introns) or sufficiently that you
> can use some other tool to help.

I think that, in your position, I'd compare timings with a six-frame,
three-frame or forward translation of (depending on the nature of the
nucleotide sequence) the nucleotide sequence you're searching against, and
then use a regular expression or string search with the protein sequence as
the query.  That's likely to be significantly faster than a regex search
with that many groups, with the effects more noticeable at larger query
sequence lengths; particularly so if you cache or save the translated
sequences for future searches.

>>> Another case where it would be useful is that tools like TBLASTN gives
>>> protein alignments so you must open the DNA sequence and find the DNA
>>> region based on the protein alignment.
>> You could use TBLASTN output - which provides start and stop coordinates for
>> the match on the subject sequence - to extract this directly, without the
>> need for backtranslation.

> Exactly my point, where is the DNA sequence?

It's in the database against which you queried; TBLASTN queries against
nucleotide databases.  Wait, that's not quite right - TBLASTN translates
nucleotide databases into protein databases and queries against them with
the protein sequence, partly because of the one-many mapping of
back-translation.

If the database is local, you can use fastacmd (part of BLAST) to dump the
entire database, to retrieve the single matching sequence from the database,
or even to extract only the region of the sequence that is the match.  Try
fastacmd --help at the command-line.

If your database is not local, you can (probably) obtain the sequence by
querying GenBank with the accession number.  If you can't do that, or ask
the people who compiled the database you're querying against, or if they
won't let you have the sequence, then you're stuck with guessing the coding
sequence.

> Only if you have direct access to the DNA sequence can you get it.

That's not true; fastacmd can extract FASTA-formatted sequences from any
(version number compatibilities notwithstanding) correctly-formatted BLAST
database.

> Furthermore, the DNA sequence
> must be exactly the same because any change in the coordinates screws it
> up.

I don't see how that is a great concern.  The coordinates of the match would
come from the same database you were searching, so should match.  If your
database is up-to-date, and you have to go to GenBank, then you should have
the most recent revision of the sequence in there, anyway.

Even if both of the above options fail, and you can acquire the new sequence
by some accession identifier, you can build a new local database from that
sequence alone, and find where the match is.  Or translate and search
directly in Python.

If you truly have no access to the DNA sequence (e.g. if it's proprietary
information, you can't access the BLAST database, and no-one will send you
the sequence) then, and only then, are you stuck with guessing the coding
sequence in *very* large parameter space.

Best,

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405

______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________