[Biopython-dev] MAF Parser/Indexer

Thu Mar 29 11:52:59 EDT 2012

Hi all,

I would like to start a discussion about what is needed to make the 
AlignIO.MafIO parser and indexer ready for the next release. If anyone 
is unfamiliar with MAF (Multiple Alignment Format), it is the file 
format that eukaryote genome-to-genome multiple alignments produced by 
multiz are stored in.

The exact specs are here:
   http://genome.ucsc.edu/FAQ/FAQformat.html#format5

Some use cases are discussed in this paper, which implements (I believe) 
most of the same functionality of the MafIO class in Galaxy:
   http://www.ncbi.nlm.nih.gov/pubmed/21775304

The branch of my biopython fork that contains the class:
   https://github.com/polyatail/biopython/tree/alignio-maf

The class is implemented as a reader/writer compatible with the AlignIO 
API, but implements its own indexer (MafIO.MafIndex) based on 
SeqIO.index_db(). At the time, this seemed like the best way to 
implement this, as MAF is explicitly designed for genome-to-genome 
alignments while other formats are not. If we can assume a MAF file 
contains such an alignment, we can index it by genome coordinates and 
allow random access to intervals.

This is especially useful since it is often desirable to retrieve the 
spliced multiple alignment of a multi-exonic transcript, which can be 
used to determine sequence conservation, construct a phylogenetic tree 
for a particular gene, or pull out orthologs of a large number of genes 
at once.

The code consists of the reader, writer, and indexer classes in 
AlignIO/MaFIO.py, test files in Tests/MAF, and unit tests specific to 
the indexer in Tests/test_MafIO_index.py. I would really appreciate any 
feedback and suggestions, and if anyone has an opportunity to use this 
feature it would be great to get some feedback on its operation.

Thanks!
Andrew