[Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update 11: PhyloXML for Biopython

Mon Aug 3 22:38:47 UTC 2009

Hi Eric;
Thanks for the update. Things are looking in great shape as we get
towards the home stretch.

>     - Most of the work done this week and last, shuffling base classes and
>       adding various checks, actually made the I/O functions a little slower.
>       I don't think this will be a big deal, and the changes were necessary,
>       but it's still a little disappointing.

The unfortunate influence of generalization. I think the adjustment
to the generalized Tree is a big win and gives a solid framework for
any future phylogenetic modules. I don't know what the numbers are
but as long as performance is reasonable, few people will complain.
This is always something to go back around on if it becomes a hangup
in the future.

>     - The networkx export will look pretty cool. After exporting a Biopython
>       tree to a networkx graph, it takes a couple more imports and commands to
>       draw the tree to the screen or a file. Would anyone find it handy to have
>       a short function in Bio.Tree or Bio.Graphics to go straight from a tree
>       to a PNG or PDF? (Dependencies: networkx, matplotlib or maybe graphviz)

Awesome. Looking forward to seeing some trees that come out of this.
It's definitely worthwhile to formalize the functionality to go
straight from a tree to png or pdf. This will add some more
localized dependencies, so I'm torn as to whether it would be best
as a utility function or an example script. Peter might have an
opinion here.

Either way, this would be really useful as a cookbook example with a
final figure. Being able to produce some pretty is a good way to
convince people to store trees in a reasonable format like PhyloXML.

>     - I have to admit this: I don't know anything about BioSQL. How would I use
>       and test the PhyloDB extension, and what's involved in writing a
>       Biopython interface for it?

BioSQL and the PhyloDB extension are a set of relational database
tables. Looking at the SVN logs, it appears as if the main work on
PhyloDB has occurred on PostgreSQL with the MySQL tables perhaps
lagging behind, so my suggestion is to start with PostgreSQL.
Hilmar, please feel free to correct me here.

The schemas are available from SVN:

http://code.open-bio.org/svnweb/index.cgi/biosql/browse/biosql-schema/trunk/sql

You'd want biosqldb-pg.sql and presumably also biosqldb-views-pg.sql
for BioSQL and biosql-phylodb-pg.sql and biosql-phylodata-pg.sql.
The Biopython docs are pretty nice on this -- you create the empty tables:

http://biopython.org/wiki/BioSQL#PostgreSQL

>From there you should be able to browse to get a sense of what is
there. In terms of writing an interface, the first step is loading
the data where you can mimic what is done with SeqIO and BioSQL:

http://biopython.org/wiki/BioSQL#Loading_Sequences_into_a_database

Pass the database an iterator of trees and they are stored.

Secondarily is retrieving and querying persisted trees. Here you
would want TreeDB objects that act like standard trees, but
retrieve information from the database on demand. Here are
Seq/SeqRecord models in BioSQL:

http://github.com/biopython/biopython/tree/master/BioSQL/BioSeq.py

So it's a bit of an extended task. Time frames being what they are,
any steps in this direction are useful. If you haven't played with
BioSQL before, it's worth a look for your own interest. The underlying
key/value model is really flexible and kind of models RDF triplets. I've
used BioSQL here recently as the backend for a web app that differs a
bit from the standard GenBank like thing, and found it very flexible.

Again, great stuff. Let me know if I can add to any of that,
Brad