[Biopython-dev] MAF Parser/Indexer
Brad Chapman
chapmanb at 50mail.com
Sun Apr 1 16:30:26 EDT 2012
Andrew;
Thanks for putting this together. It looks great, is well integrated
with AlignIO and it's awesome to see a test suite.
I dug through the code and my small suggestions would be:
- Could you refactor some of the larger functions into separate smaller
components? A couple of these spread over a ton of lines and it can be
a bit difficult to follow the logic throughout:
https://github.com/polyatail/biopython/blob/alignio-maf/Bio/AlignIO/MafIO.py#L172
https://github.com/polyatail/biopython/blob/alignio-maf/Bio/AlignIO/MafIO.py#L399
As a practical example, here you have a large block which checks the
SQLite index matches the MAF file and everything looks okay:
https://github.com/polyatail/biopython/blob/alignio-maf/Bio/AlignIO/MafIO.py#L199
This would be clearer if factored into something like:
if os.path.isfile(sqlite_file):
try:
self._record_count = self._verify_record_count(con)
except ...
- Would you be able to put together a small example for the
Cookbook or Tutorial documentation? This would be a great way to help
others get started with the functionality and advertise it.
Thanks again for this,
Brad
> Hi all,
>
> I would like to start a discussion about what is needed to make the
> AlignIO.MafIO parser and indexer ready for the next release. If anyone
> is unfamiliar with MAF (Multiple Alignment Format), it is the file
> format that eukaryote genome-to-genome multiple alignments produced by
> multiz are stored in.
>
> The exact specs are here:
> http://genome.ucsc.edu/FAQ/FAQformat.html#format5
>
> Some use cases are discussed in this paper, which implements (I believe)
> most of the same functionality of the MafIO class in Galaxy:
> http://www.ncbi.nlm.nih.gov/pubmed/21775304
>
> The branch of my biopython fork that contains the class:
> https://github.com/polyatail/biopython/tree/alignio-maf
>
> The class is implemented as a reader/writer compatible with the AlignIO
> API, but implements its own indexer (MafIO.MafIndex) based on
> SeqIO.index_db(). At the time, this seemed like the best way to
> implement this, as MAF is explicitly designed for genome-to-genome
> alignments while other formats are not. If we can assume a MAF file
> contains such an alignment, we can index it by genome coordinates and
> allow random access to intervals.
>
> This is especially useful since it is often desirable to retrieve the
> spliced multiple alignment of a multi-exonic transcript, which can be
> used to determine sequence conservation, construct a phylogenetic tree
> for a particular gene, or pull out orthologs of a large number of genes
> at once.
>
> The code consists of the reader, writer, and indexer classes in
> AlignIO/MaFIO.py, test files in Tests/MAF, and unit tests specific to
> the indexer in Tests/test_MafIO_index.py. I would really appreciate any
> feedback and suggestions, and if anyone has an opportunity to use this
> feature it would be great to get some feedback on its operation.
>
>
> Thanks!
> Andrew
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
More information about the Biopython-dev
mailing list