[Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update: PhyloXML for Biopython

Thu Jun 18 12:52:40 UTC 2009

Hi Eric;
Nice -- thanks much for the summary.

> *SeqRecord transformation*
> 
> It would be nice if I could round-trip this sequence information perfectly,
> so
> that nothing's lost between reading and writing an arbitrary, valid PhyloXML
> file. For that to work, PhyloXML.Sequence.from_seqrec() would need to look
> at
> SeqRecord.features and assume that any matching keys have the appropriate
> PhyloXML meaning.
> 
> These are the keys that from_seqrec() would look for:
>     location
>     uri
>     annotations
>     domain_architecture
> 
> Do you see any risk of collision for those names? And for serialization,
> would it be unwholesome to convert Annotation and DomainArchitecture objects
> to a GFF-style dict-in-a-string? e.g. annotation="ref=foo;source=bar;..." --
> it's another layer of parsing and kind of esoteric, but I can live with it.

SeqRecords have two places to store information related to the
sequence:

annotations -- key/value pairs describing the entire sequence,
  implemented as a dictionary with lists as values.
features -- items with a location that refer to part of the
  sequence, which can have key/value pairs, here called qualifiers.

My sense is that much of the PhyloXML markup will fit into
annotations. For instance, your annotation string should really be
part of the annotation dictionary:

{"ref" : ["foo"],
 "source" : ["bar"]
}

as opposed to a string that requires deserializing.

The easiest way to discuss this is to take a few real life cases and
see how they can fit, as Peter suggested. People here familiar with
using SeqRecords can hopefully come to a consensus as the best place
to store different items.

> *Profiling*
> 
> Christian also suggested an option to parse just the phylogenies with a
> name or id matching a given string. I like that and I don't see any problem
> with extending it to clades as well. It seems like a reasonable use case to
> select a sub-tree from a complete phyloXML document and treat it as a
> separate
> phylogeny from then on. This can be supported by various methods for
> selecting
> portions of the tree, and a method on Clade for transforming the selection
> into
> a new Phylogeny instance (so the original can be safely deleted).
[...]
> About the 2GB figure I gave earlier for the full NCBI taxonomy -- I was just
> looking at Ubuntu's system monitor, and Firefox and a few other things were
> running at the same time, taking up about 800MB already. So the full NCBI
> taxonomy actually takes up only 1.2GB or so, which isn't such a problem, and
> I
> think it will get smaller as I shrink down these PhyloXML classes.

Sounds great. I think you'll be fine with that memory usage and the
ability to select subsets based on an identifier.

> I did some profiling with the cProfile module, and it looks like most of the
> time is being spent instantiating Clade and Taxonomy objects. (Also,
> pretty_print is hugely inefficient, but that's less important.) I think I
> can
> speed up parsing and reduce memory usage by pulling the from_element methods
> out of each class and using a separate Parser class to do that work.
> 
> Questions:
>     - Do you know of a better way to profile Python code, or visualize it?
>     - Have you used __slots__ to optimize classes? Do you recommend it?

I use cProfile and pstats from the standard library, which it sounds
like you are on top of. That normally points me in the right place
to try optimizations.

I haven't used __slots__ but generally try to avoid any python black magic.
If people need additional CPU speedups, I'd suggest Psyco. This
increases memory usage so it will be a tradeoff for most people.
Benchmarks with and without Psyco would give users a guideline if
they need to optimize performance.

> And a few that don't fit anywhere else:
> 
>     - What sort of whole-tree operations would you want to do with these
>       objects that you can't do with a Nexus or Newick tree? What
>       other formats would you want to convert to? I'm thinking of adding an 
>       Export module later if there's time, for lossy conversions like a graph for
>       networkx.

This is a good general question for the users. I like the graph
conversion idea, as it avoids having to re-invent all of the graph
manipulation and query operations already present in networkx.

>     - What's the most intuitive way to display a phylogenetic tree you've
>       loaded into Biopython? Serialize as Nexus and open in TreeViewX?
> Convert
>       to a graph and send to matplotlib? Or, is there a module in
> Bio.Graphics
>       that can draw trees? (If not, should there be?)

A good general way to do this would be welcome. I've used networkx
with pygraphviz to draw rough 'n ready trees before. Here is some
horribly non-generalized code that does this:

http://github.com/chapmanb/bcbb/blob/master/visualize/tax_data_display.py

Brad

> On Wed, Jun 17, 2009 at 8:41 AM, Brad Chapman <chapmanb at 50mail.com> wrote:
> 
> > Hi Eric;
> > Nice update and thanks again for copying the Biopython development
> > list on this.
> >
> > >  * Added to_seqrecord and from_seqrecord methods to the PhyloXML.Sequence
> > > class
> > >    -- getting Bio.SeqRecord to stand in for PhyloXML.Sequence entirely
> > will
> > >    require some more thought
> >
> > I'm looking forward to seeing how you decide to go forward with
> > this. For the work I do on a day to day basis, a continual
> > struggle involves establishing relationships between things to
> > retrieve more information. For instance, a pair of nodes on a tree
> > is interesting -- how would I find papers, experiments and other
> > information associated with those sequences? It seems like Accession
> > and the ref attribute of Annotation help establish these
> > relationships.
> >
> > >  * Test-driven development kind of went out the window this week.
> >
> > Heh. It happens -- sounds sensible to have a clean up and
> > documentation week this week; that will also help others who are
> > interested dig into using it.
> >
> > >  * The unit tests I do have in place give some sense of memory and CPU
> > usage.
> > >    For the full NCBI taxonomy, memory usage climbs up above 2 GB with the
> > >    read() function, which isn't a problem on this workstation but could
> > be for
> > >    others.
> >
> > Do you see an opportunity to offer iterating over clades instead of
> > loading them all into memory for these larger trees? This would
> > involve lazily loading subclades on request and would limit some
> > functionality for querying the full tree without loading it all into
> > memory.
> >
> > Another option is to offer some pruning ability as a tree is
> > loading. For instance, if I am loading the whole NCBI taxonomy on a
> > memory limited computer and only need the Angiosperm flowering plant
> > part of the tree. In this case, you'd want to throw away all clades
> > not under the clades of interest.
> >
> > These are probably fringe cases; just brainstorming some ideas.
> >
> > Thanks again,
> > Brad
> >