[Biopython-dev] Codon Alignment GSoC Project Update

Thu Jul 18 10:07:31 UTC 2013

Hi Eric,

Thanks for the feedback. I finished implementing backward frameshift today.
There are a few things I need to mention here.

For the future plan.
There is already a Numpy style slice implemented in MultipleSeqAlignment. I
don't know why it doesn't work in CodonAlignment, as the __getitem__ method
in CodonAlignment is directly from MultipleSeqAlignment. But this is a
small issue and will be fixed soon.

I'm thinking about using tblastx output because it might helpful to deal
with forward frameshift (gaps) in translation. But it doesn't help when
there are backward frameshift, because some nucleotide will be used twice
in a single translation (None of the alignment methods can detect this as
far as I know). Actually, the underlying algorithm in Bio.CodonAlign.build
is almost the same as pal2nal and is just opposite of what you described.
Instead of translating nucleotide sequence in three reading frames, I back
translate protein sequences into degenerated codon regular expression. And
try to find a match between the translated re and given nucleotide
sequence. Frameshift detection from scratch is difficult bacause of my
method is based on regular expression and the search step is rather simple.
pal2nal can deal with user specified frameshift but doesn't try to find it
from raw sequences. My current code can handle up to 10 forward frameshift
(gaps) but only 1 backward frameshift in the sequence. I will add support
for multiple backward frameshift support in the future. However, I don't
anticipate this function to be able to handle all the situations. For
example, if two frameshift events happen too close together, it is really
hard to figure this out. Is it very often to see such frameshift in real
biological world? Another question is about the actual usage of shifted
alignment. I am unaware of any statistical methods that account for this.
Normally, when people know there is a frameshift, they probably already
figured out where it happens. Therefore, it's better to ask the user to
tell the program where the frameshift lies. Functions to facilitate this
step will be added.

A scoring scheme is pending since the score need to account for all
situations (mismatches, frameshift). I will add it when all the functions
are tested correct. It is necessary because the mechanism for mismatch
detection is very robust! It can align protein sequence to nucleotide
sequence without any relationship in theory. Therefore a maximum tolerance
should be set.

As for the MACSE, it employs a totally different strategy and is not
optimal. I will have a look at Exonerate and protein2genome procedure.
Thanks!

Best,
Zheng Ruan

On Thu, Jul 18, 2013 at 3:36 AM, Eric Talevich <eric.talevich at gmail.com>wrote:

> On Sun, Jul 14, 2013 at 9:19 PM, Zheng Ruan <zruan1991 at gmail.com> wrote:
>
>> Hi all,
>>
>> I have an update of Codon Alignment project. It can be found at
>> http://zruanweb.com/. My plan for the following three weeks is also
>> there. Thanks!
>>
>> Best,
>> Zheng Ruan
>>
>
> Hi Zheng,
>
> Nice work. Regarding future plans:
>
> - "Add Numpy slice for CodonAlignment" -- Peter voiced an interested in
> optionally using Numpy arrays for multiple sequence alignments in general.
> I suggest waiting to reach a consensus with Peter before implementing this
> feature for CodonAlignment specifically.
>
> - "Construct codon alignment based on tblastn result" -- tblastn is just a
> heuristic for fast local alignment; instead, you can use dynamic
> programming for pairwise alignments (e.g. Bio.pairwise2). You could
> translate the nucleotide sequence in 3 frames, do local pairwise alignment
> of the query protein sequence (ungapped) vs. each translated frame, then
> stitch the alignments together as best you can. It might help to generate
> lists of the offsets of each translated codon relative to the original
> nucleotide sequence, e.g. range(0, 3*(N//3)+1, 3); range(1, 3*(N//3)+2, 3);
> range(2, 3*(N//3)+3, 3). In this case the build() procedure has two
> distinct phases: Align the protein sequence to the nucleotide sequence
> optimally, then insert the gaps of the protein MSA into the codon sequences.
> - In your Week 2 diary, you mentioned having a minimum score as an option
> in the alignment function, but I don't see it in the code. I can think of a
> few reasonable versions of this. Reasonable options might be mismatch_count
> and untranslated_region_count for the number of codons that don't translate
> to the amino acid they're aligned to, and the number of skipped regions in
> the nucleotide sequence (presumably introns or UTRs in the input, although
> who knows what the user might want to do). If not specified by the user,
> the build() function should probably throw an error if those instances are
> encountered, rather than defaulting to some value. Scoring in the style of
> Exonerate seems unnecessarily open-ended.
>
> In your GSoC application, you mentioned a published method for alignment
> that might be relevant here. Did you determine that it wouldn't work here?
> Also see the Exonerate (http://www.biomedcentral.com/1471-2105/6/31), as
> their protein2genome alignment procedure does something similar to what
> you're attempting.
>
> Cheers,
> Eric
>