[Biopython] Question on seq.translate()

Peter Cock p.j.a.cock at googlemail.com
Mon Jun 5 09:09:36 UTC 2017


Hi Sebastian,

Good question.

Reading this puzzled me for a while, but I saw the problem
once I tried to reproduce it - you've told Biopython you have
unambiguous DNA, which means an unambiguous genetic
code table gets used, but your sequence has an N in it.

So the simple "fix" is to use ambiguous_dna instead:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet.IUPAC import ambiguous_dna
>>> Seq('CCGGGTT', ambiguous_dna).translate()
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/Bio/Seq.py:2101:
BiopythonWarning: Partial codon, len(sequence) not a multiple of
three. Explicitly trim the sequence or add trailing N before
translation. This may become an error in future.
  BiopythonWarning)
Seq('PG', ExtendedIUPACProtein())
>>> Seq('CCGGGTTNN', ambiguous_dna).translate()
Seq('PGX', ExtendedIUPACProtein())

I wonder if the translation code should be modified to always
use the larger genetic code tables with the ambiguous bases,
regardless of the specified alphabet?

There is an open question on if we should enforce the
alphabet letters (if used), which would have caught your
inconsistency:

https://github.com/biopython/biopython/issues/1040
https://redmine.open-bio.org/issues/2597

My feeling is we should, but it would slow down a lot of
scripts - even if using generic alphabets which do not
specify the allowed letters.

Peter

On Sat, Jun 3, 2017 at 11:28 PM, Sebastian Bassi <sbassi at gmail.com> wrote:
>
>>>> from Bio.Seq import Seq
>>>> import Bio.Alphabet
>>>> seq = Seq('CCGGGTT', Bio.Alphabet.IUPAC.unambiguous_dna)
>>>> seq.translate()
> /home/sbassi/projects/venvs/biopy169/lib/python3.5/site-packages/Bio/Seq.py:2095:
> BiopythonWarning: Partial codon, len(sequence) not a multiple of three.
> Explicitly trim the sequence or add trailing N before translation. This may
> become an error in future.
>   BiopythonWarning)
> Seq('PG', IUPACProtein())
>
> So I added two Ns to make if multiple of three. But I got this and I don't
> know if this is the intended behavior or not:
>
>>>> seq = Seq('CCGGGTTNN', Bio.Alphabet.IUPAC.unambiguous_dna)
>>>> seq.translate()
> Traceback (most recent call last):
>   File
> "/home/sbassi/projects/venvs/biopy169/lib/python3.5/site-packages/Bio/Seq.py",
> line 2107, in _translate_str
>     amino_acids.append(forward_table[codon])
> KeyError: 'TNN'
>
> During handling of the above exception, another exception occurred:
>
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File
> "/home/sbassi/projects/venvs/biopy169/lib/python3.5/site-packages/Bio/Seq.py",
> line 1038, in translate
>     cds, gap=gap)
>   File
> "/home/sbassi/projects/venvs/biopy169/lib/python3.5/site-packages/Bio/Seq.py",
> line 2124, in _translate_str
>     "Codon '{0}' is invalid".format(codon))
> Bio.Data.CodonTable.TranslationError: Codon 'TNN' is invalid
>
>
> I was expecting to have X as an unknown amino-acid, according to this note
> in the docstring:
>
> NOTE - Ambiguous codons like "TAN" or "NNN" could be an amino acid
> or a stop codon. These are translated as "X". Any invalid codon
> (e.g. "TA?" or "T-A") will throw a TranslationError.
>
>
>
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython


More information about the Biopython mailing list