[Biopython-dev] Codon Alignment GSoC Project Update

Wed Jul 17 19:36:30 UTC 2013

On Sun, Jul 14, 2013 at 9:19 PM, Zheng Ruan <zruan1991 at gmail.com> wrote:

> Hi all,
>
> I have an update of Codon Alignment project. It can be found at
> http://zruanweb.com/. My plan for the following three weeks is also
> there. Thanks!
>
> Best,
> Zheng Ruan
>

Hi Zheng,

Nice work. Regarding future plans:

- "Add Numpy slice for CodonAlignment" -- Peter voiced an interested in
optionally using Numpy arrays for multiple sequence alignments in general.
I suggest waiting to reach a consensus with Peter before implementing this
feature for CodonAlignment specifically.

- "Construct codon alignment based on tblastn result" -- tblastn is just a
heuristic for fast local alignment; instead, you can use dynamic
programming for pairwise alignments (e.g. Bio.pairwise2). You could
translate the nucleotide sequence in 3 frames, do local pairwise alignment
of the query protein sequence (ungapped) vs. each translated frame, then
stitch the alignments together as best you can. It might help to generate
lists of the offsets of each translated codon relative to the original
nucleotide sequence, e.g. range(0, 3*(N//3)+1, 3); range(1, 3*(N//3)+2, 3);
range(2, 3*(N//3)+3, 3). In this case the build() procedure has two
distinct phases: Align the protein sequence to the nucleotide sequence
optimally, then insert the gaps of the protein MSA into the codon sequences.
- In your Week 2 diary, you mentioned having a minimum score as an option
in the alignment function, but I don't see it in the code. I can think of a
few reasonable versions of this. Reasonable options might be mismatch_count
and untranslated_region_count for the number of codons that don't translate
to the amino acid they're aligned to, and the number of skipped regions in
the nucleotide sequence (presumably introns or UTRs in the input, although
who knows what the user might want to do). If not specified by the user,
the build() function should probably throw an error if those instances are
encountered, rather than defaulting to some value. Scoring in the style of
Exonerate seems unnecessarily open-ended.

In your GSoC application, you mentioned a published method for alignment
that might be relevant here. Did you determine that it wouldn't work here?
Also see the Exonerate (http://www.biomedcentral.com/1471-2105/6/31), as
their protein2genome alignment procedure does something similar to what
you're attempting.

Cheers,
Eric