[Biopython] New NCBI codon tables with 'ambiguous stop codons'
Markus Piotrowski
Markus.Piotrowski at ruhr-uni-bochum.de
Wed Jan 10 08:10:19 UTC 2018
As already explained by Peter in this Github issue
(https://github.com/biopython/biopython/issues/1224) there are new codon
tables available, which bear an issue for Biopython's translate method
(and maybe in other places).
These tables (tables 27, 28, 31) do not have explicit stop codons; a
stop codon can, depending on the context, either encode a STOP or an
amino acid. I will call them 'dual-coding stop codons' to distinguish
them from ambiguous codons, where the codon sequence itself is ambiguous
(like 'TAR').
I have made a pull request which will handle these codon tables like this:
For Bio/Data/CodonTable:
These codons will be added both to the forward_table dic and the
stop_codons list (usually, a stop codon is not present in the
forward_table).
For translations in Bio/Seq.py:
- If these tables are used (and other CodonTable objects where a stop
codon appears within the forward_list) a BiopythonWarning, which
explains the problem, is always raised.
- If 'to_stop=True' a ValueError is raised (because we don't know what
the actual encoding of such 'dual coding' would be, but 'to_stop'
explicitly asks us to translate until a stop codon appears.)
- 'Dual-coding' stop codons will always be translated into their encoded
amino acids, except if 'cds=True'.
- If 'cds=True' the final codon of the sequence will be evaluated as
stop codon. That's fine, since 'cds=True' tells us that we expect (only
one) stop codon as last codon in the sequence.
This behavior would give the following results:
translate ("ATGGCACGGAAGTGA") --> 'MARK*' (usual case)
translate ("ATGGCACGGAAGTGA", table=27) --> BiopythonWarning + 'MARKR'
(the dual-coding 'TGA' in table 27 is translated as amino acid)
translate ("ATGGCACGGAAGTGA", table=27, cds=True) --> BiopythonWarning +
'MARK' (the final codon is found to be a stop codon and will not be
translated, as usual with cds=True)
translate ("ATGGCACGGAAGTGA", table=27, to_stop=True) --> ValueError
(to_stop=True is not allowed, we don't know if a dual-coding stop codon
will encode STOP)
It would be nice to have some feedback on this. Alternative options
would be:
- don't raise a warning if 'cds=True'. However, people may have not been
aware that there is a potential problem with the respective codon table.
- only act (raise warning/exception) if such dual-coding codons are
actually used in the sequence (instead of acting immediately if such a
codon table is used). This would require to change our translation logic
in Seq.py's translate methods.
Regards,
Markus
More information about the Biopython
mailing list