[Biopython] New NCBI codon tables with 'ambiguous stop codons'

Peter Cock p.j.a.cock at googlemail.com
Wed Jan 10 10:54:52 UTC 2018


I'd discussed these ideas with Markus on GitHub and am in broad agreement.

See also Markus' pull request:

https://github.com/biopython/biopython/pull/1501

I agree it might be nice to only give a warning if the problematic
codons are present, but that comes with performance issues - so
Markus' choice to give a warning if one of these tables is used seems
very pragmatic. We could review this in a future release (narrow the
scope of the warning).

Peter

On Wed, Jan 10, 2018 at 8:10 AM, Markus Piotrowski
<Markus.Piotrowski at ruhr-uni-bochum.de> wrote:
> As already explained by Peter in this Github issue
> (https://github.com/biopython/biopython/issues/1224) there are new codon
> tables available, which bear an issue for Biopython's translate method (and
> maybe in other places).
>
> These tables (tables 27, 28, 31) do not have explicit stop codons; a stop
> codon can, depending on the context, either encode a STOP or an amino acid.
> I will call them 'dual-coding stop codons' to distinguish them from
> ambiguous codons, where the codon sequence itself is ambiguous (like 'TAR').
>
> I have made a pull request which will handle these codon tables like this:
> For Bio/Data/CodonTable:
> These codons will be added both to the forward_table dic and the stop_codons
> list (usually, a stop codon is not present in the forward_table).
>
> For translations in Bio/Seq.py:
> - If these tables are used (and other CodonTable objects where a stop codon
> appears within the forward_list) a BiopythonWarning, which explains the
> problem, is always raised.
> - If 'to_stop=True' a ValueError is raised  (because we don't know what the
> actual encoding of such 'dual coding' would be, but 'to_stop' explicitly
> asks us to translate until a stop codon appears.)
> - 'Dual-coding' stop codons will always be translated into their encoded
> amino acids, except if 'cds=True'.
> - If 'cds=True' the final codon of the sequence will be evaluated as stop
> codon. That's fine, since 'cds=True' tells us that we expect (only one) stop
> codon as last codon in the sequence.
>
> This behavior would give the following results:
> translate ("ATGGCACGGAAGTGA") --> 'MARK*' (usual case)
> translate ("ATGGCACGGAAGTGA", table=27) --> BiopythonWarning + 'MARKR' (the
> dual-coding 'TGA' in table 27 is translated as amino acid)
>
> translate ("ATGGCACGGAAGTGA", table=27, cds=True) --> BiopythonWarning +
> 'MARK' (the final codon is found to be a stop codon and will not be
> translated, as usual with cds=True)
>
> translate ("ATGGCACGGAAGTGA", table=27, to_stop=True) --> ValueError
> (to_stop=True is not allowed, we don't know if a dual-coding stop codon will
> encode STOP)
>
> It would be nice to have some feedback on this. Alternative options would
> be:
> - don't raise a warning if 'cds=True'. However, people may have not been
> aware that there is a potential problem with the respective codon table.
> - only act (raise warning/exception) if such dual-coding codons are actually
> used in the sequence (instead of acting immediately if such a codon table is
> used). This would require to change our translation logic in Seq.py's
> translate methods.
>
> Regards,
> Markus
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython


More information about the Biopython mailing list