[Biopython] Back translation support in Biopython

Mon Jul 2 11:27:08 UTC 2012

On Wed, Apr 4, 2012 at 4:02 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Wed, Apr 4, 2012 at 2:49 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
>> Hi Igor,
>>
>> It sounds like you're referring to aligning amino acid sequences to codon
>> sequences, as PAL2NAL does. This is different from what most people mean by
>> back translation, but as you point out, certainly useful.
>>
>> If you write a function that can match a protein sequence alignment to a set
>> of raw CDS sequences, returning a nucleotide alignment based on the
>> codon-to-amino-acid mapping, that would be useful. However, PAL2NAL does
>> exactly that, plus a bit more, and is a fairly well-known and easily
>> obtained program. Personally, I would prefer to write a wrapper for PAL2NAL
>> under Bio.Align.Applications, using the existing Bio.Applications framework.
>
> As per the old thread, a simple function in Python taking the gapped protein
> sequence, original nucleotide coding sequence, and the translation table
> does sound useful. Then using that, you could go from a protein alignment
> plus the original nucleotide coding sequences to a codon alignment, or
> other tasks. Given this is all relatively straightforward string manipulation
> and we already have the required genetic code tables in Biopython, I'm not
> convinced that wrapping PAL2NAL would be the best solution (for this sub
> task).

Hi Igor,

Did you do any work on back-translation (alignment threading) in Biopython?

We needed to do this locally, and for some reason (yet to be determined)
T-COFFEE wasn't working on our dataset, so I made a start at a Biopython
implementation:

https://github.com/peterjc/biopython/tree/back_trans
https://github.com/peterjc/biopython/commit/7d14cdb59bb9d41c727c923c8aa7e3dda7779c80

Currently just one commit adding a Bio.Align.alignment_back_translate(...)
function which takes a protein alignment and dictionary of nucleotide
records - easy to get with Bio.SeqIO and Bio.AlignIO - with a stand alone
example included in the doctest. There is also a new (currently private)
function to do this for one sequence pair - perhaps useful on its own?

There are potential complications with ID mapping between the proteins
and nucleotides, thus the option of a key function, and the gap characters
(would you ever want to use different gap characters in the protein and
nucleotide alignments?). We could discuss implementation details over
on the biopython-dev list, but the general API discussion might as well
be here. e.g. Where to put the function and what to call it.

Regards,

Peter