[Biopython-dev] Unified aligmment input/output, Bio.AlignIO?

Fri Apr 27 18:29:13 UTC 2007

Following the release of Biopython 1.43 with Bio.SeqIO, I would like to 
do a better job for multiple sequence alignment file formats - creating 
a new module Bio.AlignIO

While most multiple sequence alignment files usually contain a single 
alignment (made up of multiple sequences), this is not the general case.

In the PHYLIP suite, concatenated alignments in phylip format are 
produced by the seqboot program for tasks like bootstrapping of a 
phylogenetic tree.  Currently SeqIO chokes on these!

Another example is the output of some the EMBOSS programs can contain 
many multiple sequences alignments, for example the water and needle 
tools can produce many pairwise alignments.

In such cases, being able to write code like the following seems to be 
the logical extension of the Bio.SeqIO style we have agreed on:

from Bio import AlignIO
for alignment in AlignIO.parse("many.phy", "phylip") :
     print "Alignment with %i sequences of length %i" \
         % (len(alignment.get_all_seqs()),
            alignment.get_alignment_length())
     ...

i.e. The AlignIO.parse() function would be an iterator returning 
alignment objects. Does this sound reasonable so far?

As part of this work, I would also like to introduce a "RichAlignment" 
or "AnnotatedAlignment" as a subclass of the generic Alignment class 
which would be able to hold all the alignment annotation found in 
pfam/stockholm files such as alignment ID, description, comments etc 
plus all the per column annotation.

Assuming the existence of this AnnotatedAlignment class, the existing 
PFAM/Stockholm parser in Bio.SeqIO would be turned into a Bio.AlignIO 
parser to take advantage of the rich annotation.

Note - I would intend to still allow Bio.SeqIO to be used on multiple 
sequence alignment files, however the implementation may well do this 
internally via Bio.AlignIO

---------------------------------------------------------------------

This also raises the related (but separate) issue of improving the 
generic Alignment object, raised in bug 1944:

http://bugzilla.open-bio.org/show_bug.cgi?id=1944

I personally would prefer the alignment class to act more like an 
array/matrix of residues/characters.

I would also like to be able to splice an alignment (both by the 
columns, or by the sequence numbers) to get a sub-alignment.  The 
suggested AnnotatedAlignment class would have to take care of also 
splicing any per sequence or per column annotation.

Peter