[Biopython-dev] [Biopython] Exonerate Parser Error

Fri Mar 21 15:32:33 EDT 2014

Hi Bow,

I have the same problem when trying to model codon alignment with
frameshift being considered. Basically, I have a CodonSeq object to store a
coding sequence. The only difference between CodonSeq and Seq object is
that CodonSeq has an attribute -- `rf_table` (reading frame table). It's
actually a list of positions each codon starts with, so that translate()
method will go through the list to translate codon into amino acid. In this
case, it is easy to store a coding sequence with frameshift events. And
it's not necessary to split the protein to dna alignment into multiple part
when frameshift occurs. However, the problem now becomes how to obtain such
information (`rf_table`). I find exonerate is quite capable of handling
this task, especially with introns in the dna. I do think an object to
store protein to dna alignment is necessary in this scenario.

Best,
Zheng

On Fri, Mar 21, 2014 at 10:59 AM, Wibowo Arindrarto
<w.arindrarto at gmail.com>wrote:

> Hi Zheng,
>
> The nucleotide information is stored as the alignment annotation. You
> can access it using hsp.aln_annotation['query_annotation']. There,
> they are stored as triplets, reprensenting the codons.
>
> This is indeed a tradeoff that I had to make because there is no
> proper model yet to represent alignment objects containing sequences
> with different length in our master branch. In this case, the length
> of the DNA is most of
> the time 3x the length of the protein. And yes, this is not ideal
> since the actual query are now stored as an annotation ~ trading
> places with the translated query. HSPs themselves are basically
> modelled based on our MultipleSeqAlignment objects (you can get such
> objects when accessing the `aln` attribute from an HSP object). I
> think in order to properly model these types of alignment, we need to
> have a proper model of three-letter protein Seq objects as well.
>
> Your CodonSeqAlignment object may help here :), but I have not looked
> into it that much to be honest. How does it work with Seq objects with
> ProteinAlphabet? Is it possible to align protein and codon sequences?
>
> I tried storing as much information as possible using the current
> approach (e.g. notice the start and end coordinates of each hit and
> query, they are parsed from the file and the difference is not the
> same as the value you get when doing a `len` on hsp.query and/or
> hsp.hit). Note also that when dealing with frameshifts, you may want
> to access the hsp.fragments attribute, since frameshifts mean that you
> can break further your HSP alignment into multiple subalignments
> (fragments as it is called in SearchIO).
>
> Hope this helps :),
> Bow
>
> P.S. Also CC-ing the Development list ~ this looks like something
> interesting for dev in general.
>
> On Fri, Mar 21, 2014 at 3:39 PM, Zheng Ruan <zruan1991 at gmail.com> wrote:
> > Thanks Bow,
> >
> > That works for me. But it seems the parser doesn't take the nucleotide
> > information into the hsps. All I get is a pairwise alignment between two
> > proteins. Nucleotide information is useful because I want to know the
> codon
> > -- amino acid correspondence. In the case of frameshift the situation may
> > not be that straightforward. Maybe you have other concern of not doing
> this.
> >
> > Best,
> > Zheng
> >
> >
> > On Thu, Mar 20, 2014 at 7:30 PM, Wibowo Arindrarto <
> w.arindrarto at gmail.com>
> > wrote:
> >>
> >> Hi Zheng,
> >>
> >> Thank you for the files :). I found out what was causing the error and
> >> have pushed a patch along with some tests to our codebase
> >>
> >> (
> https://github.com/biopython/biopython/commit/377889b05235c2e6f192916fb610d0da01b45c6d
> ).
> >> You should be able to parse your file using the latest `master`
> >> branch.
> >>
> >> Hope this helps,
> >> Bow
> >>
> >> On Thu, Mar 20, 2014 at 9:42 PM, Zheng Ruan <zruan1991 at gmail.com>
> wrote:
> >> > Hi Bow,
> >> >
> >> > I'm happy to provide the example for testing. See attachment.
> >> >
> >> > The command to generate the output above.
> >> > exonerate --showvulgar no --showalignment yes nuc.fa pro.fa
> >> >
> >> > I'll check the test suite to see if I can find why.
> >> >
> >> > Best,
> >> > Zheng
> >> >
> >> >
> >> > On Thu, Mar 20, 2014 at 4:33 PM, Wibowo Arindrarto
> >> > <w.arindrarto at gmail.com>
> >> > wrote:
> >> >>
> >> >> > Looking at our test cases, this particular case may have slipped
> >> >> > testing. We do test for several cases of dna2protein (which could
> >> >> > explain why it works when the nucleotide sequence comes first), but
> >> >> > not protein2dna. Please let me know if I can also use your example
> as
> >> >> > a test in our test corpus :).
> >> >>
> >> >> Oops, I meant the reverse ~ we have several test cases for
> protein2dna
> >> >> which may explain why it works when the protein sequence comes first
> >> >> ;).
> >> >
> >> >
> >
> >
>