[Biopython-dev] MAF Parser/Indexer
Andrew Sczesnak
andrew.sczesnak at med.nyu.edu
Thu Mar 29 15:52:59 UTC 2012
Hi all,
I would like to start a discussion about what is needed to make the
AlignIO.MafIO parser and indexer ready for the next release. If anyone
is unfamiliar with MAF (Multiple Alignment Format), it is the file
format that eukaryote genome-to-genome multiple alignments produced by
multiz are stored in.
The exact specs are here:
http://genome.ucsc.edu/FAQ/FAQformat.html#format5
Some use cases are discussed in this paper, which implements (I believe)
most of the same functionality of the MafIO class in Galaxy:
http://www.ncbi.nlm.nih.gov/pubmed/21775304
The branch of my biopython fork that contains the class:
https://github.com/polyatail/biopython/tree/alignio-maf
The class is implemented as a reader/writer compatible with the AlignIO
API, but implements its own indexer (MafIO.MafIndex) based on
SeqIO.index_db(). At the time, this seemed like the best way to
implement this, as MAF is explicitly designed for genome-to-genome
alignments while other formats are not. If we can assume a MAF file
contains such an alignment, we can index it by genome coordinates and
allow random access to intervals.
This is especially useful since it is often desirable to retrieve the
spliced multiple alignment of a multi-exonic transcript, which can be
used to determine sequence conservation, construct a phylogenetic tree
for a particular gene, or pull out orthologs of a large number of genes
at once.
The code consists of the reader, writer, and indexer classes in
AlignIO/MaFIO.py, test files in Tests/MAF, and unit tests specific to
the indexer in Tests/test_MafIO_index.py. I would really appreciate any
feedback and suggestions, and if anyone has an opportunity to use this
feature it would be great to get some feedback on its operation.
Thanks!
Andrew
More information about the Biopython-dev
mailing list