[Biopython] Back translation support in Biopython

Sat Jul 21 17:44:40 EDT 2012


Hi Peter,
I would eliminate the problem of ID mapping (or at least pass it to the user) by using only the function that uses one sequence pair. The other option is to check if the codon and the amino acid are equivalent at run time, using a given genetic code. I did this in my program that back translated using only the aligned protein sequence and the Uniprot/GI accession numbers (I did the search using Bio.Entrez), but in my case the nucleotide dictionary was only some different ways the nucleotide sequence could be imported from NCBI, each of them returning a different sequence.
I can't see any need for different gap characters between both alignments, and I feel there can be both a Bio.SeqIO (using a pair of sequences only) and a Bio.AlignIO (using multiple sequences, probably slower if checking at run time) versions of this function. 
Att,Igor> Date: Mon, 2 Jul 2012 12:27:08 +0100
> Subject: Re: [Biopython] Back translation support in Biopython
> From: p.j.a.cock at googlemail.com
> To: igorrcosta at hotmail.com; eric.talevich at gmail.com
> CC: biopython at lists.open-bio.org
> 
> On Wed, Apr 4, 2012 at 4:02 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> > On Wed, Apr 4, 2012 at 2:49 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
> >> Hi Igor,
> >>
> >> It sounds like you're referring to aligning amino acid sequences to codon
> >> sequences, as PAL2NAL does. This is different from what most people mean by
> >> back translation, but as you point out, certainly useful.
> >>
> >> If you write a function that can match a protein sequence alignment to a set
> >> of raw CDS sequences, returning a nucleotide alignment based on the
> >> codon-to-amino-acid mapping, that would be useful. However, PAL2NAL does
> >> exactly that, plus a bit more, and is a fairly well-known and easily
> >> obtained program. Personally, I would prefer to write a wrapper for PAL2NAL
> >> under Bio.Align.Applications, using the existing Bio.Applications framework.
> >
> > As per the old thread, a simple function in Python taking the gapped protein
> > sequence, original nucleotide coding sequence, and the translation table
> > does sound useful. Then using that, you could go from a protein alignment
> > plus the original nucleotide coding sequences to a codon alignment, or
> > other tasks. Given this is all relatively straightforward string manipulation
> > and we already have the required genetic code tables in Biopython, I'm not
> > convinced that wrapping PAL2NAL would be the best solution (for this sub
> > task).
> 
> Hi Igor,
> 
> Did you do any work on back-translation (alignment threading) in Biopython?
> 
> We needed to do this locally, and for some reason (yet to be determined)
> T-COFFEE wasn't working on our dataset, so I made a start at a Biopython
> implementation:
> 
> https://github.com/peterjc/biopython/tree/back_trans
> https://github.com/peterjc/biopython/commit/7d14cdb59bb9d41c727c923c8aa7e3dda7779c80
> 
> Currently just one commit adding a Bio.Align.alignment_back_translate(...)
> function which takes a protein alignment and dictionary of nucleotide
> records - easy to get with Bio.SeqIO and Bio.AlignIO - with a stand alone
> example included in the doctest. There is also a new (currently private)
> function to do this for one sequence pair - perhaps useful on its own?
> 
> There are potential complications with ID mapping between the proteins
> and nucleotides, thus the option of a key function, and the gap characters
> (would you ever want to use different gap characters in the protein and
> nucleotide alignments?). We could discuss implementation details over
> on the biopython-dev list, but the general API discussion might as well
> be here. e.g. Where to put the function and what to call it.
> 
> Regards,
> 
> Peter