[Biopython] New NCBI codon tables with 'ambiguous stop codons'

Markus Piotrowski Markus.Piotrowski at ruhr-uni-bochum.de
Wed Jan 10 08:10:19 UTC 2018

As already explained by Peter in this Github issue 
(https://github.com/biopython/biopython/issues/1224) there are new codon 
tables available, which bear an issue for Biopython's translate method 
(and maybe in other places).

These tables (tables 27, 28, 31) do not have explicit stop codons; a 
stop codon can, depending on the context, either encode a STOP or an 
amino acid. I will call them 'dual-coding stop codons' to distinguish 
them from ambiguous codons, where the codon sequence itself is ambiguous 
(like 'TAR').

I have made a pull request which will handle these codon tables like this:
For Bio/Data/CodonTable:
These codons will be added both to the forward_table dic and the 
stop_codons list (usually, a stop codon is not present in the 

For translations in Bio/Seq.py:
- If these tables are used (and other CodonTable objects where a stop 
codon appears within the forward_list) a BiopythonWarning, which 
explains the problem, is always raised.
- If 'to_stop=True' a ValueError is raised  (because we don't know what 
the actual encoding of such 'dual coding' would be, but 'to_stop' 
explicitly asks us to translate until a stop codon appears.)
- 'Dual-coding' stop codons will always be translated into their encoded 
amino acids, except if 'cds=True'.
- If 'cds=True' the final codon of the sequence will be evaluated as 
stop codon. That's fine, since 'cds=True' tells us that we expect (only 
one) stop codon as last codon in the sequence.

This behavior would give the following results:
translate ("ATGGCACGGAAGTGA") --> 'MARK*' (usual case)
translate ("ATGGCACGGAAGTGA", table=27) --> BiopythonWarning + 'MARKR' 
(the dual-coding 'TGA' in table 27 is translated as amino acid)

translate ("ATGGCACGGAAGTGA", table=27, cds=True) --> BiopythonWarning + 
'MARK' (the final codon is found to be a stop codon and will not be 
translated, as usual with cds=True)

translate ("ATGGCACGGAAGTGA", table=27, to_stop=True) --> ValueError 
(to_stop=True is not allowed, we don't know if a dual-coding stop codon 
will encode STOP)

It would be nice to have some feedback on this. Alternative options 
would be:
- don't raise a warning if 'cds=True'. However, people may have not been 
aware that there is a potential problem with the respective codon table.
- only act (raise warning/exception) if such dual-coding codons are 
actually used in the sequence (instead of acting immediately if such a 
codon table is used). This would require to change our translation logic 
in Seq.py's translate methods.


More information about the Biopython mailing list