[Biopython-dev] Unified aligmment input/output, Bio.AlignIO?
Peter
biopython-dev at maubp.freeserve.co.uk
Fri Apr 27 18:29:13 UTC 2007
Following the release of Biopython 1.43 with Bio.SeqIO, I would like to
do a better job for multiple sequence alignment file formats - creating
a new module Bio.AlignIO
While most multiple sequence alignment files usually contain a single
alignment (made up of multiple sequences), this is not the general case.
In the PHYLIP suite, concatenated alignments in phylip format are
produced by the seqboot program for tasks like bootstrapping of a
phylogenetic tree. Currently SeqIO chokes on these!
Another example is the output of some the EMBOSS programs can contain
many multiple sequences alignments, for example the water and needle
tools can produce many pairwise alignments.
In such cases, being able to write code like the following seems to be
the logical extension of the Bio.SeqIO style we have agreed on:
from Bio import AlignIO
for alignment in AlignIO.parse("many.phy", "phylip") :
print "Alignment with %i sequences of length %i" \
% (len(alignment.get_all_seqs()),
alignment.get_alignment_length())
...
i.e. The AlignIO.parse() function would be an iterator returning
alignment objects. Does this sound reasonable so far?
As part of this work, I would also like to introduce a "RichAlignment"
or "AnnotatedAlignment" as a subclass of the generic Alignment class
which would be able to hold all the alignment annotation found in
pfam/stockholm files such as alignment ID, description, comments etc
plus all the per column annotation.
Assuming the existence of this AnnotatedAlignment class, the existing
PFAM/Stockholm parser in Bio.SeqIO would be turned into a Bio.AlignIO
parser to take advantage of the rich annotation.
Note - I would intend to still allow Bio.SeqIO to be used on multiple
sequence alignment files, however the implementation may well do this
internally via Bio.AlignIO
---------------------------------------------------------------------
This also raises the related (but separate) issue of improving the
generic Alignment object, raised in bug 1944:
http://bugzilla.open-bio.org/show_bug.cgi?id=1944
I personally would prefer the alignment class to act more like an
array/matrix of residues/characters.
I would also like to be able to splice an alignment (both by the
columns, or by the sequence numbers) to get a sub-alignment. The
suggested AnnotatedAlignment class would have to take care of also
splicing any per sequence or per column annotation.
Peter
More information about the Biopython-dev
mailing list