[BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object?

Wed Oct 22 10:03:32 UTC 2008

On 21/10/2008 20:46, "Bruce Southey" <bsouthey at gmail.com> wrote:
> Thank you for agreeing with me! I am glad that you realized that the
> genetic code prevents a true one to many relationship.

Bruce, I am not agreeing with you.  I'll try to clarify it another way:

More than one codon can encode the amino acid arginine (this is a many-one
relationship). The amino acid arginine can be 'decoded' to more than one
codon (this is a one-many relationship).

Imagine a function that accepts an amino acid as input and returns a valid
codon that could encode for the input amino acid.  This is 'decoding' as
described above, and is the process of back-translation for a single amino
acid.

For a single (i.e. 'one') amino acid, arginine, as input, the function might
correctly provide up to six (i.e. 'many') different valid answers.  This
makes it a one-many problem.  Further external constraints (e.g. Codon
tables) may be applied to restrict the number or likelihood of each codon
being correct in specific cases, but the fundamental problem is one-many.

Providing arginine as input to a particular coded version of this function
might in all cases only return a single codon as output (one-one), but the
problem itself is still one-many.

Furthermore, even though only one codon was responsible -
biologically-speaking - for encoding the arginine you're submitting to the
function (one-one), your question is the inverse: effectively 'what codon
encoded this arginine?'.  But (and it's a big but), if you don't know
beforehand what that codon is (and why else would you bother using the
function?), the problem is one-many, as any of the six solutions might be
correct.

Analogously, there are two possible values for the square root of a positive
real number, such as 4.  It is inherently a one-many problem.  For 4, the
return value could, correctly, be +2 or -2.  Now, the math.sqrt() function
in Python follows mathematical convention for the radical, and only returns
the positive value, but that does not make the relationship between the
value and its square root one-one, it only makes that implementation of the
function one-one, even though the answer could be, correctly, either
positive or negative.

Now, if your problem is: what is the length of side of a farmer's square
field with area four square miles (big field!), only one of these answers
makes sense (one-one), as the field is constrained by our reality and cannot
have negative length (this is effectively equivalent to saying that the
organism doesn't use five of the six possible codons for arginine, so only
one answer is possible).  However, the general problem of finding a square
root is still one-many, as you can see if you rephrase the problem as 'the
vector (a 0) has length 4; what is the value of a?'.  This is directly
analogous to the problem 'the amino acid arginine was encoded by a codon;
what codon was it?'.

> This very much depends on how you want to use it. TBLASTN is not very
> good for very short sequences and can not handle protein domains/motifs
> such as those in Prosite.

That's a fair point, and I wouldn't (and didn't ;) ) recommend TBLASTN as a
solution to all such problems.  I get acceptable results for exact matches
down to about 7aa on default settings, though.  Short query sequences can be
a problem whatever method you use, though.

>> TBLASTN queries against
>> nucleotide databases.  Wait, that's not quite right -
> No, it is not even correct! :-)

Yes, it is correct.  From:
http://www.ncbi.nlm.nih.gov/blast/blast_program.shtml (and other
references...)

"""
tblastn
compares a protein query sequence against a nucleotide sequence database
dynamically translated in all reading frames
""" 

They wrote it, so they should know.  Not that I've checked the code ;)

>> TBLASTN translates
>> nucleotide databases into protein databases and queries against them with
>> the protein sequence, partly because of the one-many mapping of
>> back-translation.
> Not exactly as stop codons are not in protein databases except where
> they code for an amino acid.

Stop codons are not (usually) in protein databases, that's true.  But they
*are* in nucleotide databases, which is what TBLASTN queries.  For example,
these are TBLASTN search results, in opposite directions on the same
nucleotide sequence, that span stop codons in the subject sequence,
indicated by '*' in the BLAST output (even though there are different stop
codons; Artemis handles this more elegantly):

>ref|NC_004547.2| Erwinia carotovora subsp. atroseptica SCRI1043, complete
genome
          Length = 5064019

 Score = 79.0 bits (193), Expect = 8e-17
 Identities = 38/40 (95%), Positives = 38/40 (95%), Gaps = 2/40 (5%)
 Frame = +2

Query: 1  YPHSTAEYLILFE-INPRS-PFFCWIFWNLMLRDVDLENF 38
          YPHSTAEYLILFE INPRS PFFCWIFWNLMLRDVDLENF
Sbjct: 2  YPHSTAEYLILFE*INPRS*PFFCWIFWNLMLRDVDLENF 121

>ref|NC_004547.2| Erwinia carotovora subsp. atroseptica SCRI1043, complete
genome
          Length = 5064019

 Score = 56.6 bits (135), Expect = 4e-10
 Identities = 29/32 (90%), Positives = 29/32 (90%), Gaps = 3/32 (9%)
 Frame = -3

Query: 1       CNGRWRC-SPL-CYISPRISCRSW-LKPSAIV 29
               CNGRWRC SPL CYISPRISCRSW LKPSAIV
Sbjct: 2851610 CNGRWRC*SPL*CYISPRISCRSW*LKPSAIV 2851515

>> That's not true; fastacmd can extract FASTA-formatted sequences from any
>> (version number compatibilities notwithstanding) correctly-formatted BLAST
>> database.
>>   
> Obviously because you still have direct access to the DNA sequence.

I'd call it indirect access if you've, say, downloaded a precompiled nt
database from NCBI and then have to extract the FASTA sequence from that
compiled database.  Either way, if you're querying a nucleotide database,
you've got to have a representation of the nucleotide sequence *somewhere*.

>> Even if both of the above options fail, and you can acquire the new sequence
>> by some accession identifier, you can build a new local database from that
>> sequence alone, and find where the match is.  Or translate and search
>> directly in Python.
>>   
> These were some of the things that one was trying to avoid, especially
> repeating it all over again and hoping like crazy that it is still
> present. 

Some things are just harder work than others ;)

> (Genome assemblies are not very forgiving.)

The genomes I've worked on have had stable sequences at revision points for
both assembly and annotation (though the old revision points have not been
kept publicly in all cases, which can be awkward).  All should, IMO.  But
that's a different thread on a different mailing list...

Best,

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405

______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________