[Biopython-dev] MAF Parser/Writer/Indexer

Fri May 13 21:26:58 UTC 2011

Hi All,

I'd like to contribute MAF parser/writer classes to Bio.AlignIO.  MAF is 
an alignment format used for whole genome alignments, as in the 30-way 
(or more) multiz alignments at UCSC:

http://hgdownload.cse.ucsc.edu/goldenPath/mm9/multiz30way/maf/

A description of the format is available here:

http://genome.ucsc.edu/FAQ/FAQformat#format5

The value of this format to most users will come from the ability to 
extract sequences from an arbitrary number of species that align to a 
particular sequence range in a particular genome, at random.  We should 
be able to say, report the alignment of 50 genomes to the human HOX 
locus fairly quickly (say <1s).  An iterator and writer class will 
certainly be useful, but to implement the aforementioned functionality, 
some API changes are probably necessary.

I think the most straightforward way of accomplishing this is to add an 
additional, searchable SQLite table to SeqIO's index_db().  The present 
table, offset_data, translates a unique sequence identifier to the file 
offset and is more suited to multifasta or other sequence files. 
Another table might store chromosome, start, and end positions to allow 
a set of alignment records falling within a particular sequence range on 
a chromosome to be extracted with an SQL query (obscured from the user). 
  This table would remain empty in formats where no search functionality 
is implemented.

Also necessary, a search() function on top of the index_db() UserDict, 
accessible as in:

from AlignIO.MafIO import MafIndexer

indexer = MafIndexer("mm9")

index = SeqIO.index_db (index_file, maf_file, "maf", \
                         key_function = MafIndexer.index)

for i in index.search ("chr5", 5000, 10000):
     print i

where the output is a series of MultipleSeqAlignment objects with 
sequences falling within the searched range.  When used with other 
formats, the function could perform a quick "key LIKE '%key%'" SQL query 
to retrieve multiple records with similar names.  As a note, the 
MafIndexer callback function above is necessary to choose which species 
in the alignment the index is generated for.

Some quick code implementing these additions loads the index of a 3.6GB 
MAF file in ~500ms and retrieves a 40kb alignment in about 1.6s, leaving 
some room for optimization.

Does anyone have any thoughts on how index_db() should be developed, and 
if these changes ought to be implemented in SeqIO or an AlignIO index 
API be created?

Thanks,

-- 
Andrew Sczesnak
Bioinformatician, Littman Lab
Howard Hughes Medical Institute
New York University School of Medicine
540 First Avenue
New York, NY 10016

p: (212) 263-6921
f: (212) 263-1498
e: andrew.sczesnak at med.nyu.edu