[Biopython-dev] Unified alignment input/output, Bio.AlignIO?
Peter
biopython-dev at maubp.freeserve.co.uk
Mon May 7 14:16:38 UTC 2007
Peter wrote:
> Following the release of Biopython 1.43 with Bio.SeqIO, I would like to
> do a better job for multiple sequence alignment file formats - creating
> a new module Bio.AlignIO
>
> While most multiple sequence alignment files usually contain a single
> alignment (made up of multiple sequences), this is not the general case.
>
> In the PHYLIP suite, concatenated alignments in phylip format are
> produced by the seqboot program for tasks like bootstrapping of a
> phylogenetic tree. Currently SeqIO chokes on these!
>
> Another example is the output of some the EMBOSS programs can contain
> many multiple sequences alignments, for example the water and needle
> tools can produce many pairwise alignments.
>
> In such cases, being able to write code like the following seems to be
> the logical extension of the Bio.SeqIO style we have agreed on:
>
> from Bio import AlignIO
> for alignment in AlignIO.parse("many.phy", "phylip") :
> print "Alignment with %i sequences of length %i" \
> % (len(alignment.get_all_seqs()),
> alignment.get_alignment_length())
> ...
>
> i.e. The AlignIO.parse() function would be an iterator returning
> alignment objects. Does this sound reasonable so far?
I have pressed ahead with this, there is a version attached to bug 2285
http://bugzilla.open-bio.org/show_bug.cgi?id=2285
This handles reading and writing of clustal, phylip, stockholm/pfam. I
have not yet converted the Bio.SeqIO Nexus parser. Also, I plan to add a
parser for reading the EMBOSS alignment format.
As a side effect, this will actually remove a lot of the Bio.SeqIO code
as handling any alignment file can be delegated to Bio.AlignIO instead.
Would anyone like to comment on the scheme?
Peter
More information about the Biopython-dev
mailing list