[Biopython] Back translation support in Biopython

Peter Cock p.j.a.cock at googlemail.com
Mon Jul 2 11:27:08 UTC 2012


On Wed, Apr 4, 2012 at 4:02 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Wed, Apr 4, 2012 at 2:49 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
>> Hi Igor,
>>
>> It sounds like you're referring to aligning amino acid sequences to codon
>> sequences, as PAL2NAL does. This is different from what most people mean by
>> back translation, but as you point out, certainly useful.
>>
>> If you write a function that can match a protein sequence alignment to a set
>> of raw CDS sequences, returning a nucleotide alignment based on the
>> codon-to-amino-acid mapping, that would be useful. However, PAL2NAL does
>> exactly that, plus a bit more, and is a fairly well-known and easily
>> obtained program. Personally, I would prefer to write a wrapper for PAL2NAL
>> under Bio.Align.Applications, using the existing Bio.Applications framework.
>
> As per the old thread, a simple function in Python taking the gapped protein
> sequence, original nucleotide coding sequence, and the translation table
> does sound useful. Then using that, you could go from a protein alignment
> plus the original nucleotide coding sequences to a codon alignment, or
> other tasks. Given this is all relatively straightforward string manipulation
> and we already have the required genetic code tables in Biopython, I'm not
> convinced that wrapping PAL2NAL would be the best solution (for this sub
> task).

Hi Igor,

Did you do any work on back-translation (alignment threading) in Biopython?

We needed to do this locally, and for some reason (yet to be determined)
T-COFFEE wasn't working on our dataset, so I made a start at a Biopython
implementation:

https://github.com/peterjc/biopython/tree/back_trans
https://github.com/peterjc/biopython/commit/7d14cdb59bb9d41c727c923c8aa7e3dda7779c80

Currently just one commit adding a Bio.Align.alignment_back_translate(...)
function which takes a protein alignment and dictionary of nucleotide
records - easy to get with Bio.SeqIO and Bio.AlignIO - with a stand alone
example included in the doctest. There is also a new (currently private)
function to do this for one sequence pair - perhaps useful on its own?

There are potential complications with ID mapping between the proteins
and nucleotides, thus the option of a key function, and the gap characters
(would you ever want to use different gap characters in the protein and
nucleotide alignments?). We could discuss implementation details over
on the biopython-dev list, but the general API discussion might as well
be here. e.g. Where to put the function and what to call it.

Regards,

Peter



More information about the Biopython mailing list