[Biopython-dev] MAF Parser/Writer/Indexer

Peter Cock p.j.a.cock at googlemail.com
Mon May 16 11:14:05 UTC 2011


On Fri, May 13, 2011 at 10:26 PM, Andrew Sczesnak
<andrew.sczesnak at med.nyu.edu> wrote:
> Hi All,
>
> I'd like to contribute MAF parser/writer classes to Bio.AlignIO.  MAF is an
> alignment format used for whole genome alignments, as in the 30-way (or
> more) multiz alignments at UCSC:
>
> http://hgdownload.cse.ucsc.edu/goldenPath/mm9/multiz30way/maf/
>
> A description of the format is available here:
>
> http://genome.ucsc.edu/FAQ/FAQformat#format5
>

I started work on merging the basic parser/writer into Biopython
on this new branch,

https://github.com/peterjc/biopython/tree/alignio-maf

As I think I mentioned by email before, there were some PEP8
formatting changes (removing spaces before brackets).

Another little thing rather than MultipleSeqAlignment(alphabet)
you should use MultipleSeqAlignment([], alphabet) to create
an empty alignment. The former works with a deprecation
warning to help transition from the old alignment object.

Note that by hooking up "maf" in AlignIO as an output format,
it will get exercised by some of the unit tests, in particular
test_AlignIO.py - and that showed some problems.

On a functional level your code was not preserving the order
of the records within each alignment. By using a dictionary the
order becomes Python implementation specific, meaning it cannot
be assumed in unit tests (i.e. C Python vs Jython vs IronPython vs
PyPy could all store dictionary elements in a different order).
Also it was also breaking test_AlignIO.py, so I changed that.

Do you think we should follow the speciesOrder directive if
present?

Note that right now, test_AlignIO.py is still not passing (which
is a major reason why I haven't merged this to the trunk).
Currently the issue is to do with how you are parsing species
names, assuming database.chromosome is not possible in general.

Also I think we may need to do something rigorous with start/end
co-ordinates and strand in either the Seq or SeqRecord object.
They could be updated automatically during slicing and taking
reverse complement... they might not survive addition though.

Peter




More information about the Biopython-dev mailing list