[Biopython-dev] unambiguous DNA

Thu Aug 10 03:43:07 EDT 2000

thomas at cbs.dtu.dk  asked:
>I'd like to get 'X' instead of a '*' (stop signal) when there is no
>clear translation ...(when extracting all possible ORFs from raw - often
>pure - sequence data during e.g. complete genome seqeuncing projects)
>eg.
>the translation of  'NATGATTANAATNTATTCCATTATATTG'
>should result in
>       XDXNXFHYI
>instead of
>       *D*N*FHYI
>
>is this already  possible ?

Yes.  However, the default way (with utils.translate) won't do it.
That's there mostly as a bootstrap to make it really easy to do the
simple things, without needing to understand all I did with alphabet
types.

First off, "NAT" doesn't map to a single residue - it can be a Y,
or H or N or ... In fact, it doesn't map to a stop codon, so the
current biopython code wouldn't try to create a '*' character.
If you tried the following:

from Bio import Seq
from Bio.Alphabet import IUPAC
from Bio.Tools import Translate
seq = Seq.Seq('NATGATTANAATNTATTCCATTATATTG', IUPAC.ambiguous_dna)
protein = Translate.ambiguous_dna_by_id[1].translate(seq, "X")

you would get an exception - Bio.Data.CodonTable.TranslationError: NAT

So you need to define your problem more precisely.  You have:
  1) input sequence uses the ambiguous DNA alphabet
  2) output sequence uses the ambiguous protein alphabet along with
      '*' for stop codon symbol and 'X' for untranslatable protein residues
  3) translation table the standard table

There isn't an alphabet in biopython which supports 2), so we need to
make a new one:

from Bio import Alphabet
from Bio.Data import IUPACData
class ProteinX(Alphabet.ProteinAlphabet):
   letters = IUPACData.extended_protein_letters + "X"

proteinX = ProteinX()

(To interoperate well with the other tools, you would also need to
set up the right encodings for ProteinX molecular weight table and
tell the PropertyManager about it.)

The existing translation tables raise an exception if there isn't
a given codon, so what we can do is write a wrapper class around
an existing ambiguous translation table which returns an "X" when
that happens.  The Codon table only needs to implement the "get"
function for "translate" (it needs __getitem__ for "translate_to_stop").

from Bio.Data import CodonTable
from Bio.Alphabet import IUPAC

# Forward translation table, mapping codon to protein
class MissingTable:
  def __init__(self, table):
    self._table = table
  def get(self, codon, stop_symbol):
    try:
      return self._table.get(codon, stop_symbol)
    except CodonTable.TranslationError:
      return 'X'

# Make the codon table given an existing table
def makeTableX(table):
  assert table.protein_alphabet == IUPAC.extended_protein
  return CodonTable.CodonTable(table.nucleotide_alphabet, proteinX,
                               MissingTable(table.forward_table),
                               table.back_table, table.start_codons,
                               table.stop_codons)

standard_table = makeTableX(CodonTable.ambiguous_dna_by_id[1])
# Could also convert the other translation tables ...

# Now make a translator object based on that codon table
from Bio.Tools import Translate
translator = Translate.Translator(standard_table)

from Bio import Seq
>>> translator.translate(Seq.Seq("NATGATTANAATNTATTCCATTATATTG", \
...                              IUPAC.ambiguous_dna))
Seq('XDXNXFHYI', ProteinX())
>>> translator.translate(Seq.Seq("NATGATTANAATNTATTCCATTATATTGTTTAR", \
...                              IUPAC.ambiguous_dna))
Seq('XDXNXFHYIV*', ProteinX())

     =       =       =       =       =        =      =     =
         (This is the section you probably want to use.)

As you can see, this still translates stop codons to "*" if there
isn't any other option ("TAR" is either "TAA" or "TAG", which are
both stop codons).  Suppose you didn't want that.  Then you might
try the following:

class MissingTable2:
  def __init__(self, table):
    self._table = table
  def __getitem__(self, codon):  # translate_to_codon uses [codon]
    try:                         #   instead of get(codon, stop_symbol)
      return self._table.get(codon, 'X')  # Always uses "X"
    except CodonTable.TranslationError:
      return 'X'                 # for what would be stop codons

# Make the codon table given an existing table
def makeTableX2(table):
  assert table.protein_alphabet == IUPAC.extended_protein
  return CodonTable.CodonTable(table.nucleotide_alphabet, proteinX,
                               MissingTable2(table.forward_table),
                               table.back_table, table.start_codons,
                               table.stop_codons)

standard_table2 = makeTableX2(CodonTable.ambiguous_dna_by_id[1])
translator2 = Translate.Translator(standard_table2)

>>> translator2.translate_to_stop(Seq.Seq("NATGATTANAATNTATTCCATTATATTG",
...                                       IUPAC.ambiguous_dna))
Seq('XDXNXFHYI', ProteinX())
>>>
translator2.translate_to_stop(Seq.Seq("NATGATTANAATNTATTCCATTATATTGTTTAR",
...                                       IUPAC.ambiguous_dna))
Seq('XDXNXFHYIVX', ProteinX())

    =        =         =          =           =
This section is here for edification only.  It does not produce
what you want.

Another way you might try to do it is to define 'X' as an
ambiguous character for any standard protein symbol.  The
AmbiguouCodonTable takes a regular Codon table and two tables
of values for the nucleotide and protein ambiguity codes, and
produces a CodonTable which tries to find the minimal ambiguity
needed to cover the input ambiguity.  Since 'X' codes for any
protein, it becomes the fail-safe.  That is:

proteinX_values = IUPACData.extended_protein_values.copy()
proteinX_values['X'] = IUPACData.extended_protein_letters

def makeTableX3(table):
  assert table.protein_alphabet == IUPAC.protein
  return CodonTable.AmbiguousCodonTable(table, IUPAC.ambiguous_dna,
                                        IUPACData.ambiguous_dna_values,
                                        proteinX, proteinX_values)

std_table = makeTableX3(CodonTable.unambiguous_dna_by_id[1])
translator3 = Translate.Translator(std_table)

This works for some cases, like NATGATTAGAATNTATTCCATTATATTGTTTAR,
which generates 'XD*NXFHYIV*', but it does not work for "TAN" becase
that codes for both stop codons and for amino acids.  The
AmbiguousCodonTable doesn't allow that to occur since 'X' is only
a protein symbol and '*' is only a stop symbol - there isn't
a symbol which covers both cases.

I chose this because I wanted strict typing in the alphabets, and
I knew you could get around it with a bit more work, as shown above.

 =  =  =  =  =  =

This may seem complicated.  I would love to see a cleaner way to do it,
but I couldn't think of one which let me:
  use (ambiguous and unambigous) (dna/rna and protein)
  distinguish when an ambiguous codon can be both a stop codon and
      code for an amino acid

So what I did was provide the simplest case (standard alphabets and
common translation tables) as well as tools to make it do exactly
what you want it to do.

                    Andrew

P.S.
  There is a bug in the existing Alphabet.AlphabetEncoder.__getattr__.
It shouldn't forward anything starting and ending with "__".  The
"ProteinX()" shown after the call to translate() really should be
a 'HasStopCodon(ProteinX(), "*")'.  The object is correct, it's just
using the wrong __repr__ for printing.