[Biopython-dev] MAF Parser/Writer/Indexer
Andrew Sczesnak
andrew.sczesnak at med.nyu.edu
Fri May 13 21:26:58 UTC 2011
Hi All,
I'd like to contribute MAF parser/writer classes to Bio.AlignIO. MAF is
an alignment format used for whole genome alignments, as in the 30-way
(or more) multiz alignments at UCSC:
http://hgdownload.cse.ucsc.edu/goldenPath/mm9/multiz30way/maf/
A description of the format is available here:
http://genome.ucsc.edu/FAQ/FAQformat#format5
The value of this format to most users will come from the ability to
extract sequences from an arbitrary number of species that align to a
particular sequence range in a particular genome, at random. We should
be able to say, report the alignment of 50 genomes to the human HOX
locus fairly quickly (say <1s). An iterator and writer class will
certainly be useful, but to implement the aforementioned functionality,
some API changes are probably necessary.
I think the most straightforward way of accomplishing this is to add an
additional, searchable SQLite table to SeqIO's index_db(). The present
table, offset_data, translates a unique sequence identifier to the file
offset and is more suited to multifasta or other sequence files.
Another table might store chromosome, start, and end positions to allow
a set of alignment records falling within a particular sequence range on
a chromosome to be extracted with an SQL query (obscured from the user).
This table would remain empty in formats where no search functionality
is implemented.
Also necessary, a search() function on top of the index_db() UserDict,
accessible as in:
from AlignIO.MafIO import MafIndexer
indexer = MafIndexer("mm9")
index = SeqIO.index_db (index_file, maf_file, "maf", \
key_function = MafIndexer.index)
for i in index.search ("chr5", 5000, 10000):
print i
where the output is a series of MultipleSeqAlignment objects with
sequences falling within the searched range. When used with other
formats, the function could perform a quick "key LIKE '%key%'" SQL query
to retrieve multiple records with similar names. As a note, the
MafIndexer callback function above is necessary to choose which species
in the alignment the index is generated for.
Some quick code implementing these additions loads the index of a 3.6GB
MAF file in ~500ms and retrieves a 40kb alignment in about 1.6s, leaving
some room for optimization.
Does anyone have any thoughts on how index_db() should be developed, and
if these changes ought to be implemented in SeqIO or an AlignIO index
API be created?
Thanks,
--
Andrew Sczesnak
Bioinformatician, Littman Lab
Howard Hughes Medical Institute
New York University School of Medicine
540 First Avenue
New York, NY 10016
p: (212) 263-6921
f: (212) 263-1498
e: andrew.sczesnak at med.nyu.edu
More information about the Biopython-dev
mailing list