[BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object?

Mon Oct 20 09:38:10 UTC 2008

On 17/10/2008 19:46, "Bruce Southey" <bsouthey at gmail.com> wrote:

> Leighton Pritchard wrote:
>> This is the key problem.  Forward translation is - for a given codon table -
>> a one-one mapping.  Reverse translation is (for many amino acids) one-many.
>> If the goal is to produce the coding sequence that actually encoded a
>> particular protein sequence, the problem is combinatorial and rapidly
>> becomes messy with increasing sequence length.
>>   
> If you use a regular expression or a tree structure then there is a
> one-one mapping but then that would probably best as a subclass of Seq.

I don't see this, I'm afraid.

Each codon -> one amino acid : one-one mapping
Arg -> set of 6 possible codons : one-many mapping

It doesn't matter how it's represented in code, the problem of a one-many
mapping still exists for amino acid -> codon translation in most cases.

The combinatorial nature of the overall problem can be illustrated by
considering the unlikely case of a protein that comprises 100 arginines.
The number of potential coding sequences is 6**100 = 6.5e77.  That you *can*
choose any one of these to be your potential coding sequence doesn't negate
the fact that there are still (6.5e77)-1 other possibilities... It doesn't
get much better if you use the the average number of codons per amino acid:
61/20 ~= 3.  A 100aa protein would typically have 3**100 ~= 5e47 potential
coding sequences.  I wouldn't want to guess which one was correct, and I
can't see a back_translate method in this instance doing more than producing
a nucleotide sequence that is potentially capable of producing the passed
protein sequence, but for which no claims can be made about biological
plausibility.

Now, a back_translate() that takes a protein sequence alignment and, when
passed the coding sequences for each component sequence, returns the
corresponding alignment of the nucleotide sequences, makes sense to me.  But
that's a discussion for Bio.Alignment objects...

> I would suggest tools like Wise2 and exonerate
> (http://www.ebi.ac.uk/~guy/exonerate/) are the solution to solving gene
> structure problems than using a Seq object.

I wouldn't suggest using a Seq object for this purpose, either... ;)

>> I agree - I can't think of an occasion where I might want to back-translate
>> a protein in this way that wouldn't better be handled by other means.  Not
>> that I'm the fount of all use-cases but, given the number of ways in which
>> one *could* back-translate, perhaps it would be better not to pick/guess at
>> any single one.
>>   
> Apart from the academic aspect, my main use is searching for protein
> motifs/domains, enzyme cleavage sites, finding very short combinations
> of amino acids and binding sites (I do not do this but it is the same)
> in DNA sequences especially genomic sequence. These are usually very
> small and, thus, unsuitable for most tools.

I do much the same, and haven't found a pressing use for back-translation,
yet - YMMV.

> One of my uses is with
> peptide identification and de novo sequencing using mass spectrometry
> when you don't know the actual protein or gene sequence. It also has the
> problem that certain amino acids have very similar mass so you would
> need to  Regardless of whether you use a regular expression query or not
> you still need a back translation of the protein query and probably the
> reverse complement.

Perhaps I'm being dense, but I don't see why that is.  Can you give an
example?

> Another case where it would be useful is that tools like TBLASTN gives
> protein alignments so you must open the DNA sequence and find the DNA
> region based on the protein alignment.

You could use TBLASTN output - which provides start and stop coordinates for
the match on the subject sequence - to extract this directly, without the
need for backtranslation.  Example output where subject coordinates give the
match location below:

>ref|NC_004547.2| Erwinia carotovora subsp. atroseptica SCRI1043, complete
genome
          Length = 5064019

 Score =  731 bits (1887), Expect = 0.0
 Identities = 363/376 (96%), Positives = 363/376 (96%)
 Frame = +3

Query: 1      MFHXXXXXXXXXXXXXTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY
60
              MFH             TISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY
Sbjct: 477432 MFHLPKLKQKPLALLLTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY
477611

[...]

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405

______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________