[Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 11: PhyloXML for Biopython

Wed Aug 5 22:31:31 UTC 2009

On Mon, Aug 3, 2009 at 6:38 PM, Brad Chapman <chapmanb at 50mail.com> wrote:

> Hi Eric;
> Thanks for the update. Things are looking in great shape as we get
> towards the home stretch.
>
> >     - Most of the work done this week and last, shuffling base classes
> and
> >       adding various checks, actually made the I/O functions a little
> slower.
> >       I don't think this will be a big deal, and the changes were
> necessary,
> >       but it's still a little disappointing.
>
> The unfortunate influence of generalization. I think the adjustment
> to the generalized Tree is a big win and gives a solid framework for
> any future phylogenetic modules. I don't know what the numbers are
> but as long as performance is reasonable, few people will complain.
> This is always something to go back around on if it becomes a hangup
> in the future.
>

The complete unit test suite used to take about 4.5 seconds, and now it
takes 5.8 seconds, though I've added a few more tests since then. I don't
think it will feel like it's hanging for most operations, besides parsing or
searching a huge tree.

 >     - The networkx export will look pretty cool. After exporting a
> Biopython
> >       tree to a networkx graph, it takes a couple more imports and
> commands to
> >       draw the tree to the screen or a file. Would anyone find it handy
> to have
> >       a short function in Bio.Tree or Bio.Graphics to go straight from a
> tree
> >       to a PNG or PDF? (Dependencies: networkx, matplotlib or maybe
> graphviz)
>
> Awesome. Looking forward to seeing some trees that come out of this.
> It's definitely worthwhile to formalize the functionality to go
> straight from a tree to png or pdf. This will add some more
> localized dependencies, so I'm torn as to whether it would be best
> as a utility function or an example script. Peter might have an
> opinion here.
>
> Either way, this would be really useful as a cookbook example with a
> final figure. Being able to produce some pretty is a good way to
> convince people to store trees in a reasonable format like PhyloXML.
>

OK, it works now but the resulting trees look a little odd. The options
needed to get a reasonable tree representation are fiddly, so I made
draw_graphviz() a separate function that basically just handles the RTFM
work (not trivial), while the graph export still happens in to_networkx().

Here are a few recipes and a taste of each dish. The matplotlib engine seems
usable for interactive exploration, albeit cluttered -- I can't hide the
internal clade identifiers since graphviz needs unique labels, though maybe
I could make them less prominent. Drawing directly to PDF gets cluttered for
big files, and if you stray from the default settings (I played with it a
bit to get it right), it can look surreal. There would still be some benefit
to having a reportlab-based tree module in Bio.Graphics, and maybe one day
I'll get around to that.

$ ipython -pylab
from Bio import Tree, TreeIO
apaf = TreeIO.read('apaf.xml', 'phyloxml')

Tree.draw_graphviz(apaf)
# http://etal.myweb.uga.edu/phylo-nx-apaf.png

Tree.draw_graphviz(apaf, 'apaf.pdf')
# http://etal.myweb.uga.edu/apaf.pdf

Tree.draw_graphviz(apaf, 'apaf.png', format='png', prog='dot')
# http://etal.myweb.uga.edu/apaf.png -- why it's best to leave the defaults
alone

Thoughts: the internal node labels could be clear instead of red; if a node
doesn't have a name, it could check its taxonomy attribute to see if
anything's there; there's probably a way to make pygraphviz understand
distinct nodes that happen to have the same label, although I haven't found
it yet. Is PDF a good default format, or would PNG or PostScript be better?

 >     - I have to admit this: I don't know anything about BioSQL. How would
> I use
> >       and test the PhyloDB extension, and what's involved in writing a
> >       Biopython interface for it?
>
> BioSQL and the PhyloDB extension are a set of relational database
> tables. Looking at the SVN logs, it appears as if the main work on
> PhyloDB has occurred on PostgreSQL with the MySQL tables perhaps
> lagging behind, so my suggestion is to start with PostgreSQL.
> Hilmar, please feel free to correct me here.
>

> [...]
>

> So it's a bit of an extended task. Time frames being what they are,
> any steps in this direction are useful. If you haven't played with
> BioSQL before, it's worth a look for your own interest. The underlying
> key/value model is really flexible and kind of models RDF triplets. I've
> used BioSQL here recently as the backend for a web app that differs a
> bit from the standard GenBank like thing, and found it very flexible.
>
>
I think I've seen that app, but I thought it was backed by AppEngine. Neat
stuff. I will learn BioSQL for my own benefit, but I don't think there's
enough time left in GSoC for me to add a useful PhyloDB adapter to
Biopython. So that, along with refactoring Nexus.Trees to use
Bio.Tree.BaseTree, would be a good project to continue with in the fall, at
a slower pace and with more discussion along the way.

Cheers,
Eric