[Biopython-dev] [Bug 2381] translate and transcibe methods for the Seq object (in Bio.Seq)

Tue Nov 4 18:28:19 UTC 2008

http://bugzilla.open-bio.org/show_bug.cgi?id=2381

------- Comment #35 from biopython-bugzilla at maubp.freeserve.co.uk  2008-11-04 13:28 EST -------
(In reply to comment #34)
> As I think about this and the various comments, I do that you must apply the
> same reasoning to non-standard translation as was applied to the ORF finding
> comments. From that I understand that you want a basic translation function so
> function arguments like to_stop or cds_start would be inappropriate.

There is certainly an argument that the Bio.Seq translate function/methods
should be kept as simple as possible while providing widely useful
functionality.  Perhaps given the lack of immediate agreement we are at that
point already?  Or perhaps this is a reflection of the different types of
organisms people work with and thus the relative frequencies of non-standard
start codons.

> Also, even if it was possible, I do not see that validating all known start
> codons under all genetic codes fits here.

We have the valid start codons in the CodonTable objects derived from the NCBI,
so it is possible to check them.

> ... Address any non-standard codons to the translated sequence. If you are
> going to allow non-standard start codons, you also need to handle
> selenocysteine (http://en.wikipedia.org/wiki/Selenocysteine) and less so
> pyrrolysine (http://en.wikipedia.org/wiki/Pyrrolysine). 

Why?  Non-standard codons are pretty common in prokaryotes and the rules for
translating them are simple (once the start codon is identified).

On the other hand selenocysteine and pyrrolysine are very rare, and we can't
define a computer rule to deal with them - so we don't even try.

> The non-standard codon usages are rare and I do really question if these are
> really part of the Seq object translate function or belong elsewhere. I really
> feel that if the user already knows that it is a non-AUG start codon then they
> can replace the first amino acid with Met rather than rely on the translate
> function. For example, the CDS field in the Genbank record for Mouse
> Neuropeptide W (NM_001099664) has:
> /exception="alternative start codon"
> /note="non-AUG (CUG) translation initiation codon".
> So if the user looked at the record then then would know it would need to be
> changed.

Non-standard start codons are not that rare in prokaryotes (and I would not
expect them to be annotated like your mouse example).  When translating a well
annotated sequence, the location itself should be enough.

[I'm assuming we're not talking about the other meaning of the phrase
"alternative start codons" - where a gene may have multiple valid start codons
giving proteins of different lengths but the same C-terminal region.]

> If some form of the non-standard codons is included I would think some
> variantof Leighton's assert idea should be preferred such as using an
> assert_nonstandard argument (or just nonstandard). This would be a string, 
> list or tuple to denote the changes to be made such as say 'Met1' or 'M1'
> where three or single letter code of the desired amino acid and the number
> is the location within the amino acid sequence to be changed. So Met1 would
> mean changing the amino acid at position one with Methionine (M). But I
> recognize this is not sufficient to handle other non-standard cases with
> stop codons.

I thought Leighton was just proposing another name for a boolean argument which
I had called "init" in attachment 1032.

I'm afraid I don't understand your idea of a complicated list argument.

=============================================================================

Here is a concrete example, there are 418 annotated genes in E. coli K12 with
non-standard start codons - which you might want to translate into proteins.

#Using
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K12_substr__MG1655/NC_000913.ffn
>>> from Bio import SeqIO
>>> odd = [record for record in SeqIO.parse(open("NC_000913.ffn"),"fasta") \
           if str(record.seq[:3]) <> "ATG"]
>>> print "There are %i genes not starting ATG" % len(odd)
There are 481 genes not starting ATG
>>> record = odd[0]
>>> print record.format("fasta")
>ref|NC_000913.2|:5234-5530
GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA
GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT
AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT
TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT
AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA

This starts GTC which is a valid bacterial start codon.  I'd like to translate
this and get the actual biologically relevant protein as given in the GenBank
file NC_000913.gbk (maybe with or without the stop symbol at the end).  See:

     CDS             5234..5530
                     /gene="yaaX"
                     /locus_tag="b0005"
                     /codon_start=1
                     /transl_table=11
                     /product="predicted protein"
                     /protein_id="NP_414546.1"
                     /db_xref="ASAP:ABE-0000015"
                     /db_xref="UniProtKB/Swiss-Prot:P75616"
                     /db_xref="GI:16127999"
                     /db_xref="ECOCYC:G6081"
                     /db_xref="EcoGene:EG14384"
                     /db_xref="GeneID:944747"
                     /translation="MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGY
                     YWDGGHWRDHGWWKQHYEWRGNRWHLHGPPPPPRHHKKAPHDHHGGHGPGKHHR"

Without any non-standard start codon support, my translations start with a V:

>>> print record.seq.translate(table=11)
VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHLHGPPPPPRHHKKAPHDHHGGHGPGKHHR*
>>> print record.seq.translate(table=11, to_stop=True)
VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHLHGPPPPPRHHKKAPHDHHGGHGPGKHHR

With this proposed functionality I can obtain the desired results (both with
and without the terminator stop symbol):

>>> print record.seq.translate(table=11, to_stop=True, init=True)
MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHLHGPPPPPRHHKKAPHDHHGGHGPGKHHR
>>> print record.seq.translate(table=11, init=True)
MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHLHGPPPPPRHHKKAPHDHHGGHGPGKHHR*

I think that wanting to translate a CDS like this is a fairly common operation.
 Perhaps not as common as translation of a partial sequence, or translating
whole genomes or contigs where we want to translate through the stop codons --
but nevertheless, a common need.

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.