[Biopython-dev] Alignment columns as strings or Seq objects?

Thu May 13 11:47:48 UTC 2010

Peter wrote:
> Hello all,
>
> Are there any outstanding issues we should address before making
> the Biopython 1.54 release?
>
> ...
>
> One thing I am wondering about is making column extraction in
> the new alignment object return a string rather than a Seq object.
> I'll start another thread on this issue...

I remember we debated this a bit before but can't find the
thread right now. See also Bug 3066 where I am proposing
to add methods to iterate over the rows or columns as strings.
http://bugzilla.open-bio.org/show_bug.cgi?id=3066

The main benefit of using a plain string when extracting the
alignment columns is speed. Because the data is stored by
row, each time we extract a column we would have to build
a new instance of the Seq object. For large alignments (and
thinking ahead to next-gen alignment objects) this could be
a painful overhead.

Because the whole alignment has an alphabet, we can use this
to assign an alphabet to a column sequence. Note that the rows
of the alignments could have slightly different alphabets. So it
is possible (and the current code does this) to generate a Seq
object with a meaningful alphabet from a column.

Why is this useful? Other than the alphabet, the main benefit
of using a Seq object is consistency. On a practical level, the
Seq object's biological translate method isn't appropriate at all
for an alignment column. On the other hand, one might possibly
want to use (back)transcribe to flip between DNA and RNA,
and maybe even take the complement.

Are there any strong views here on how alignment slicing to
get a column should behave? i.e. should align[:,9] return the
column as a string or as a Seq?

Peter