[Biopython-dev] MAF Parser/Indexer

Sun Apr 1 20:30:26 UTC 2012

Andrew;
Thanks for putting this together. It looks great, is well integrated
with AlignIO and it's awesome to see a test suite.

I dug through the code and my small suggestions would be:

- Could you refactor some of the larger functions into separate smaller
  components? A couple of these spread over a ton of lines and it can be
  a bit difficult to follow the logic throughout:

  https://github.com/polyatail/biopython/blob/alignio-maf/Bio/AlignIO/MafIO.py#L172
  https://github.com/polyatail/biopython/blob/alignio-maf/Bio/AlignIO/MafIO.py#L399

  As a practical example, here you have a large block which checks the
  SQLite index matches the MAF file and everything looks okay:

  https://github.com/polyatail/biopython/blob/alignio-maf/Bio/AlignIO/MafIO.py#L199

  This would be clearer if factored into something like:

  if os.path.isfile(sqlite_file):
     try:
        self._record_count = self._verify_record_count(con)
     except ...

- Would you be able to put together a small example for the
  Cookbook or Tutorial documentation? This would be a great way to help
  others get started with the functionality and advertise it.

Thanks again for this,
Brad

> Hi all,
> 
> I would like to start a discussion about what is needed to make the 
> AlignIO.MafIO parser and indexer ready for the next release. If anyone 
> is unfamiliar with MAF (Multiple Alignment Format), it is the file 
> format that eukaryote genome-to-genome multiple alignments produced by 
> multiz are stored in.
> 
> The exact specs are here:
>    http://genome.ucsc.edu/FAQ/FAQformat.html#format5
> 
> Some use cases are discussed in this paper, which implements (I believe) 
> most of the same functionality of the MafIO class in Galaxy:
>    http://www.ncbi.nlm.nih.gov/pubmed/21775304
> 
> The branch of my biopython fork that contains the class:
>    https://github.com/polyatail/biopython/tree/alignio-maf
> 
> The class is implemented as a reader/writer compatible with the AlignIO 
> API, but implements its own indexer (MafIO.MafIndex) based on 
> SeqIO.index_db(). At the time, this seemed like the best way to 
> implement this, as MAF is explicitly designed for genome-to-genome 
> alignments while other formats are not. If we can assume a MAF file 
> contains such an alignment, we can index it by genome coordinates and 
> allow random access to intervals.
> 
> This is especially useful since it is often desirable to retrieve the 
> spliced multiple alignment of a multi-exonic transcript, which can be 
> used to determine sequence conservation, construct a phylogenetic tree 
> for a particular gene, or pull out orthologs of a large number of genes 
> at once.
> 
> The code consists of the reader, writer, and indexer classes in 
> AlignIO/MaFIO.py, test files in Tests/MAF, and unit tests specific to 
> the indexer in Tests/test_MafIO_index.py. I would really appreciate any 
> feedback and suggestions, and if anyone has an opportunity to use this 
> feature it would be great to get some feedback on its operation.
> 
> 
> Thanks!
> Andrew
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev