[Biopython-dev] Improving the Alignment object
Peter
biopython-dev at maubp.freeserve.co.uk
Fri Jul 27 13:11:03 EDT 2007
Jan Kosinski wrote:
> We had another discussion in the lab about that Alignment object should
> not store records in the list but rather in a dictionary (but keeping
> information about sequence order ) or so. What is you reasoning for
> making Alignment object a list of SeqRecord objects?
In a sense the Bio.Align.Generic.Alignment object always was a list of
SeqRecords (if you look at the internal implementation that is), and I
hadn't stopped to really question it. I like having list like behaviour
and exploit this in a lot of my code dealing with alignments.
The are some nice things about having dictionary like behaviour in an
alignment class, but unless a notional sequence order is preserved, this
breaks the array of characters model.
Also, using a dictionary like alignment would force the user to specify
unique keys for each record (e.g. the record.id) which is something the
current list-like-alignment does not require.
Perhaps we could have a "dictionary like" sub class of Alignment where
the __getitem__ method would allow a record identifier in place of a row
index:
print aln["P3454"]
print aln["P3454", 20]
instead or as well as:
print aln[10]
print aln[10, 20]
> One should carefully think about design of the Alignment class since it
> will influence all further steps. As now the class is in its infancy
> there is a very good moment for thinking what the Alignment class is for
> and what it should support.
I had viewed the new __getitem__ method as a backwards compatible
enhancement of the existing stable (but rather limited)
Bio.Generic.Alignment class. That's not to say we can't design a new
class from scratch - I just prefer gradual improvements without breaking
existing usage.
I am particularly keen to allow splicing of alignments. For example, you
could select the conserved core of an alignment by removing the left
most 10 columns and the right most ten columns:
align_core = aln[:,10:-10]
> For instance, the Alignment object should
> support changing characters in the alignment without a need of copying
> it (using aln[a,x] = "D"). Can it be done now with Alignment which is
> a list of SeqRecord objects with sequences implemented as immutable Seq
> objects ?
No, right now you can't easily edit sequences in a Bio.Generic.Alignment
(even with the proposed change) as it is implemented using immutable Seq
objects. I personally haven't needed to edit an alignment like this. Is
this something you want to do often?
To me the obvious way to handle this is to have a MutableAlignment
sub-class, where editing individual elements with aln[r,c] = "D" would
be supported (possibly implemented using the MutableSeq class internally
rather than the immutable Seq class).
On a related point, I was planning to raise the following suggestion in
the future - adding alignments, like this:
combined_aln = aln1 + aln2
e.g. aln1 had 5 rows of length 10, and aln2 had 5 rows of length 15,
then the result of aln1+aln2 would have 5 rows of length 25.
Alignment addition would only be defined for alignments with the same
number of rows (perhaps also restricted to the same sequence type, and
row weights?). The result would contain the same number of rows, where
each sequence was the concatenation of the corresponding two rows in the
input alignments. I'd suggest concatenating the record.id's (if
different) however one could argue that it would be better to insist the
user had made sure the two alignments had consistent identifiers.
An example of where this could be used is taking alignments of multiple
sets of homologous genes, sorting them to use the same species order,
and then creating a concatenated alignment for robust phylogenetic tree
construction.
Peter
More information about the Biopython-dev
mailing list