[Biopython] Start positions for local pairwise alignments?

Jan T Kim jttkim at googlemail.com
Fri Sep 7 08:51:36 UTC 2012


On Thu, Sep 06, 2012 at 01:01:46AM +0100, Peter Cock wrote:
> On Mon, Sep 3, 2012 at 10:31 AM, Jan T Kim <jttkim at googlemail.com> wrote:
> > Dear All,
> >
> > after reading a pairwise alignment computed using the EMBOSS water
> > program, is it possible to find out the indices of the sequences in
> > the local alignment within the input sequences?
> >
> > ...
> >
> > I can't seem to find that information anywhere either in the resulting
> > Bio.Align.MultipleSeqAlignment object, or in the SeqRecord objects
> > that it contains.
> >
> > So, am I looking at the wrong place?
> 
> No, these number are not currently being parsed. This applies to
> some of the other file formats in AlignIO too, because we (still)
> don't have an agreed way to store this in our object model.

Ok, thanks for clarifying. I think I understand, I wasn't sure whether to
expect that information in the Seq, the SeqRecord or the MultipleAlignment
objects.

For what it's worth, it currently would seem most adequate to me
if a (say) AlignedSeq subclass of Seq could provide a couple of
optional additional instance variables, such as the start index
of the aligned sequence within the input sequence.

I'd envision this information to be optional in the sense that the
instance variable would be None if the start position is not
available, which would obviously be the case for some alignment
formats (for most multiple alignments, in fact).

> Last time I used this parser, I was probably using needle rather
> than water, where these are global alignments so you don't need
> the start/end values.

Incidentally I initially used needle as well, but then got additional
data which contained elevated levels of "junk", which required a
switch to local alignments. In this case there was a region of
interest with a subsequence that was unique, so I could figure out
whether the region of interest was aligned or not, but that approach
can be unreliable when repetitive regions are involved and / or
definitions of the "region of interest" are subject to shifts. So
I'd think having the start index where available would be useful
in the long run.

Best regards & have a nice weekend all, Jan
-- 
 +- Jan T. Kim -------------------------------------------------------+
 |             email: jttkim at gmail.com                                |
 |             WWW:   http://www.jtkim.dreamhosters.com/              |
 *-----=<  hierarchical systems are for files, not for humans  >=-----*



More information about the Biopython mailing list