[Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update: PhyloXML for Biopython

Sun Jun 21 10:13:40 EDT 2009

2009/6/17 Eric Talevich <eric.talevich at gmail.com>

> Hi Brad,
>
> Here's a mid-week update and partial response to your questions.
>
> *SeqRecord transformation*
>
> It would be nice if I could round-trip this sequence information perfectly,
> so
> that nothing's lost between reading and writing an arbitrary, valid
> PhyloXML
> file. For that to work, PhyloXML.Sequence.from_seqrec() would need to look
> at
> SeqRecord.features and assume that any matching keys have the appropriate
> PhyloXML meaning.
>
> These are the keys that from_seqrec() would look for:
>     location
>     uri
>     annotations
>     domain_architecture
>
> Do you see any risk of collision for those names? And for serialization,
> would it be unwholesome to convert Annotation and DomainArchitecture objects
> to a GFF-style dict-in-a-string? e.g. annotation="ref=foo;source=bar;..." --
> it's another layer of parsing and kind of esoteric, but I can live with it.
>
>
> *Profiling*
>
> Christian also suggested an option to parse just the phylogenies with a
> name or id matching a given string. I like that and I don't see any problem
> with extending it to clades as well. It seems like a reasonable use case to
> select a sub-tree from a complete phyloXML document and treat it as a
> separate
> phylogeny from then on. This can be supported by various methods for
> selecting
> portions of the tree, and a method on Clade for transforming the selection
> into
> a new Phylogeny instance (so the original can be safely deleted).
>

I like this idea. I will do the same for PhyloXML implementation in BioRuby.

Diana

>
> I did some profiling with the cProfile module, and it looks like most of
> the
> time is being spent instantiating Clade and Taxonomy objects. (Also,
> pretty_print is hugely inefficient, but that's less important.) I think I
> can
> speed up parsing and reduce memory usage by pulling the from_element
> methods
> out of each class and using a separate Parser class to do that work.
>
> About the 2GB figure I gave earlier for the full NCBI taxonomy -- I was
> just
> looking at Ubuntu's system monitor, and Firefox and a few other things were
> running at the same time, taking up about 800MB already. So the full NCBI
> taxonomy actually takes up only 1.2GB or so, which isn't such a problem,
> and I
> think it will get smaller as I shrink down these PhyloXML classes.
>
> Questions:
>     - Do you know of a better way to profile Python code, or visualize it?
>     - Have you used __slots__ to optimize classes? Do you recommend it?
>
> And a few that don't fit anywhere else:
>
>     - What sort of whole-tree operations would you want to do with these
>       objects that you can't do with a Nexus or Newick tree? What other
> formats
>       would you want to convert to? I'm thinking of adding an Export module
>       later if there's time, for lossy conversions like a graph for
> networkx.
>
>     - What's the most intuitive way to display a phylogenetic tree you've
>       loaded into Biopython? Serialize as Nexus and open in TreeViewX?
> Convert
>       to a graph and send to matplotlib? Or, is there a module in
> Bio.Graphics
>       that can draw trees? (If not, should there be?)
>
> Thanks,
> Eric
>
>
>
> On Wed, Jun 17, 2009 at 8:41 AM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
>> Hi Eric;
>> Nice update and thanks again for copying the Biopython development
>> list on this.
>>
>> >  * Added to_seqrecord and from_seqrecord methods to the
>> PhyloXML.Sequence
>> > class
>> >    -- getting Bio.SeqRecord to stand in for PhyloXML.Sequence entirely
>> will
>> >    require some more thought
>>
>> I'm looking forward to seeing how you decide to go forward with
>> this. For the work I do on a day to day basis, a continual
>> struggle involves establishing relationships between things to
>> retrieve more information. For instance, a pair of nodes on a tree
>> is interesting -- how would I find papers, experiments and other
>> information associated with those sequences? It seems like Accession
>> and the ref attribute of Annotation help establish these
>> relationships.
>>
>> >  * Test-driven development kind of went out the window this week.
>>
>> Heh. It happens -- sounds sensible to have a clean up and
>> documentation week this week; that will also help others who are
>> interested dig into using it.
>>
>> >  * The unit tests I do have in place give some sense of memory and CPU
>> usage.
>> >    For the full NCBI taxonomy, memory usage climbs up above 2 GB with
>> the
>> >    read() function, which isn't a problem on this workstation but could
>> be for
>> >    others.
>>
>> Do you see an opportunity to offer iterating over clades instead of
>> loading them all into memory for these larger trees? This would
>> involve lazily loading subclades on request and would limit some
>> functionality for querying the full tree without loading it all into
>> memory.
>>
>> Another option is to offer some pruning ability as a tree is
>> loading. For instance, if I am loading the whole NCBI taxonomy on a
>> memory limited computer and only need the Angiosperm flowering plant
>> part of the tree. In this case, you'd want to throw away all clades
>> not under the clades of interest.
>>
>> These are probably fringe cases; just brainstorming some ideas.
>>
>> Thanks again,
>> Brad
>>
>
>
> _______________________________________________
> Wg-phyloinformatics mailing list
> Wg-phyloinformatics at nescent.org
> https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics
>
>