[Biopython-dev] Improving the Alignment object

Fri Jul 27 13:11:03 EDT 2007

Jan Kosinski wrote:
> We had another discussion in the lab about that Alignment object should 
> not store records in the list but rather in a dictionary (but keeping 
> information about sequence order ) or so.  What is you reasoning for 
> making Alignment object a list of SeqRecord objects?

In a sense the Bio.Align.Generic.Alignment object always was a list of 
SeqRecords (if you look at the internal implementation that is), and I 
hadn't stopped to really question it. I like having list like behaviour 
and exploit this in a lot of my code dealing with alignments.

The are some nice things about having dictionary like behaviour in an 
alignment class, but unless a notional sequence order is preserved, this 
breaks the array of characters model.

Also, using a dictionary like alignment would force the user to specify 
unique keys for each record (e.g. the record.id) which is something the 
current list-like-alignment does not require.

Perhaps we could have a "dictionary like" sub class of Alignment where 
the __getitem__ method would allow a record identifier in place of a row 
index:

print aln["P3454"]
print aln["P3454", 20]

instead or as well as:

print aln[10]
print aln[10, 20]

> One should carefully think about design of the Alignment class since it 
> will influence all further steps. As now the class is in its infancy 
> there is a very good moment for thinking what the Alignment class is for 
> and what it should support.

I had viewed the new __getitem__ method as a backwards compatible 
enhancement of the existing stable (but rather limited) 
Bio.Generic.Alignment class. That's not to say we can't design a new 
class from scratch - I just prefer gradual improvements without breaking 
existing usage.

I am particularly keen to allow splicing of alignments. For example, you 
could select the conserved core of an alignment by removing the left 
most 10 columns and the right most ten columns:

align_core = aln[:,10:-10]

 > For instance, the Alignment object should
> support changing characters in the alignment without a need of copying 
> it (using  aln[a,x] = "D"). Can it be done now with Alignment which is 
> a list of SeqRecord objects with sequences implemented as immutable Seq 
> objects ?

No, right now you can't easily edit sequences in a Bio.Generic.Alignment 
(even with the proposed change) as it is implemented using immutable Seq 
objects. I personally haven't needed to edit an alignment like this.  Is 
this something you want to do often?

To me the obvious way to handle this is to have a MutableAlignment 
sub-class, where editing individual elements with aln[r,c] = "D" would 
be supported (possibly implemented using the MutableSeq class internally 
rather than the immutable Seq class).

On a related point, I was planning to raise the following suggestion in 
the future - adding alignments, like this:

combined_aln = aln1 + aln2

e.g. aln1 had 5 rows of length 10, and aln2 had 5 rows of length 15, 
then the result of aln1+aln2 would have 5 rows of length 25.

Alignment addition would only be defined for alignments with the same 
number of rows (perhaps also restricted to the same sequence type, and 
row weights?). The result would contain the same number of rows, where 
each sequence was the concatenation of the corresponding two rows in the 
input alignments. I'd suggest concatenating the record.id's (if 
different) however one could argue that it would be better to insist the 
user had made sure the two alignments had consistent identifiers.

An example of where this could be used is taking alignments of multiple 
sets of homologous genes, sorting them to use the same species order, 
and then creating a concatenated alignment for robust phylogenetic tree 
construction.

Peter