[Biopython-dev] Unified alignment input/output, Bio.AlignIO?

Peter biopython-dev at maubp.freeserve.co.uk
Mon May 7 14:16:38 UTC 2007


Peter wrote:
> Following the release of Biopython 1.43 with Bio.SeqIO, I would like to 
> do a better job for multiple sequence alignment file formats - creating 
> a new module Bio.AlignIO
> 
> While most multiple sequence alignment files usually contain a single 
> alignment (made up of multiple sequences), this is not the general case.
> 
> In the PHYLIP suite, concatenated alignments in phylip format are 
> produced by the seqboot program for tasks like bootstrapping of a 
> phylogenetic tree.  Currently SeqIO chokes on these!
> 
> Another example is the output of some the EMBOSS programs can contain 
> many multiple sequences alignments, for example the water and needle 
> tools can produce many pairwise alignments.
> 
> In such cases, being able to write code like the following seems to be 
> the logical extension of the Bio.SeqIO style we have agreed on:
> 
> from Bio import AlignIO
> for alignment in AlignIO.parse("many.phy", "phylip") :
>      print "Alignment with %i sequences of length %i" \
>          % (len(alignment.get_all_seqs()),
>             alignment.get_alignment_length())
>      ...
> 
> i.e. The AlignIO.parse() function would be an iterator returning 
> alignment objects. Does this sound reasonable so far?

I have pressed ahead with this, there is a version attached to bug 2285

http://bugzilla.open-bio.org/show_bug.cgi?id=2285

This handles reading and writing of clustal, phylip, stockholm/pfam. I 
have not yet converted the Bio.SeqIO Nexus parser. Also, I plan to add a 
parser for reading the EMBOSS alignment format.

As a side effect, this will actually remove a lot of the Bio.SeqIO code 
  as handling any alignment file can be delegated to Bio.AlignIO instead.

Would anyone like to comment on the scheme?

Peter




More information about the Biopython-dev mailing list