[Biopython-dev] [Biopython] Exonerate Parser Error

Fri Mar 21 10:59:40 EDT 2014

Hi Zheng,

The nucleotide information is stored as the alignment annotation. You
can access it using hsp.aln_annotation['query_annotation']. There,
they are stored as triplets, reprensenting the codons.

This is indeed a tradeoff that I had to make because there is no
proper model yet to represent alignment objects containing sequences
with different length in our master branch. In this case, the length
of the DNA is most of
the time 3x the length of the protein. And yes, this is not ideal
since the actual query are now stored as an annotation ~ trading
places with the translated query. HSPs themselves are basically
modelled based on our MultipleSeqAlignment objects (you can get such
objects when accessing the `aln` attribute from an HSP object). I
think in order to properly model these types of alignment, we need to
have a proper model of three-letter protein Seq objects as well.

Your CodonSeqAlignment object may help here :), but I have not looked
into it that much to be honest. How does it work with Seq objects with
ProteinAlphabet? Is it possible to align protein and codon sequences?

I tried storing as much information as possible using the current
approach (e.g. notice the start and end coordinates of each hit and
query, they are parsed from the file and the difference is not the
same as the value you get when doing a `len` on hsp.query and/or
hsp.hit). Note also that when dealing with frameshifts, you may want
to access the hsp.fragments attribute, since frameshifts mean that you
can break further your HSP alignment into multiple subalignments
(fragments as it is called in SearchIO).

Hope this helps :),
Bow

P.S. Also CC-ing the Development list ~ this looks like something
interesting for dev in general.

On Fri, Mar 21, 2014 at 3:39 PM, Zheng Ruan <zruan1991 at gmail.com> wrote:
> Thanks Bow,
>
> That works for me. But it seems the parser doesn't take the nucleotide
> information into the hsps. All I get is a pairwise alignment between two
> proteins. Nucleotide information is useful because I want to know the codon
> -- amino acid correspondence. In the case of frameshift the situation may
> not be that straightforward. Maybe you have other concern of not doing this.
>
> Best,
> Zheng
>
>
> On Thu, Mar 20, 2014 at 7:30 PM, Wibowo Arindrarto <w.arindrarto at gmail.com>
> wrote:
>>
>> Hi Zheng,
>>
>> Thank you for the files :). I found out what was causing the error and
>> have pushed a patch along with some tests to our codebase
>>
>> (https://github.com/biopython/biopython/commit/377889b05235c2e6f192916fb610d0da01b45c6d).
>> You should be able to parse your file using the latest `master`
>> branch.
>>
>> Hope this helps,
>> Bow
>>
>> On Thu, Mar 20, 2014 at 9:42 PM, Zheng Ruan <zruan1991 at gmail.com> wrote:
>> > Hi Bow,
>> >
>> > I'm happy to provide the example for testing. See attachment.
>> >
>> > The command to generate the output above.
>> > exonerate --showvulgar no --showalignment yes nuc.fa pro.fa
>> >
>> > I'll check the test suite to see if I can find why.
>> >
>> > Best,
>> > Zheng
>> >
>> >
>> > On Thu, Mar 20, 2014 at 4:33 PM, Wibowo Arindrarto
>> > <w.arindrarto at gmail.com>
>> > wrote:
>> >>
>> >> > Looking at our test cases, this particular case may have slipped
>> >> > testing. We do test for several cases of dna2protein (which could
>> >> > explain why it works when the nucleotide sequence comes first), but
>> >> > not protein2dna. Please let me know if I can also use your example as
>> >> > a test in our test corpus :).
>> >>
>> >> Oops, I meant the reverse ~ we have several test cases for protein2dna
>> >> which may explain why it works when the protein sequence comes first
>> >> ;).
>> >
>> >
>
>