[Biopython-dev] Alignment columns as strings or Seq objects?
Michiel de Hoon
mjldehoon at yahoo.com
Thu May 13 20:29:59 EDT 2010
I would definitely use a plain string. A Seq object suggests that we're dealing with a real biological sequence, which a column in the alignment matrix is not. The only advantage of having a Seq object is that it has an alphabet associated with it. But alphabets are very rarely used in practice, if at all. Reverse complementing or (back-)transcribing are available in the Bio.Seq module as functions that can operate on plain strings, so we don't need a Seq object for that.
--Michiel.
--- On Thu, 5/13/10, Peter <biopython at maubp.freeserve.co.uk> wrote:
> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: [Biopython-dev] Alignment columns as strings or Seq objects?
> To: "Biopython-Dev Mailing List" <biopython-dev at biopython.org>
> Date: Thursday, May 13, 2010, 7:47 AM
> Peter wrote:
> > Hello all,
> >
> > Are there any outstanding issues we should address
> before making
> > the Biopython 1.54 release?
> >
> > ...
> >
> > One thing I am wondering about is making column
> extraction in
> > the new alignment object return a string rather than a
> Seq object.
> > I'll start another thread on this issue...
>
> I remember we debated this a bit before but can't find the
> thread right now. See also Bug 3066 where I am proposing
> to add methods to iterate over the rows or columns as
> strings.
> http://bugzilla.open-bio.org/show_bug.cgi?id=3066
>
> The main benefit of using a plain string when extracting
> the
> alignment columns is speed. Because the data is stored by
> row, each time we extract a column we would have to build
> a new instance of the Seq object. For large alignments
> (and
> thinking ahead to next-gen alignment objects) this could
> be
> a painful overhead.
>
> Because the whole alignment has an alphabet, we can use
> this
> to assign an alphabet to a column sequence. Note that the
> rows
> of the alignments could have slightly different alphabets.
> So it
> is possible (and the current code does this) to generate a
> Seq
> object with a meaningful alphabet from a column.
>
> Why is this useful? Other than the alphabet, the main
> benefit
> of using a Seq object is consistency. On a practical level,
> the
> Seq object's biological translate method isn't appropriate
> at all
> for an alignment column. On the other hand, one might
> possibly
> want to use (back)transcribe to flip between DNA and RNA,
> and maybe even take the complement.
>
> Are there any strong views here on how alignment slicing
> to
> get a column should behave? i.e. should align[:,9] return
> the
> column as a string or as a Seq?
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
More information about the Biopython-dev
mailing list