[BioRuby] BioRuby Phyloxml update

Tue Nov 17 14:52:59 UTC 2009

Thanks for discussion. I see Naohisa's point that it is difficult to
keep consistency when copying a tree.

Right now PhyloXML class inherits from Bio::Tree class. Instead, I
could write a new general Bio::FamilyTree class (per Pjotr's
suggestion), which would be strictly a tree (I believe that Bio::Tree
allows for a node to have 2 parents) and would have parent/child
information. Thus it would not need underlying general graph
implementation, therefore making the implementation simpler than that
of Bio::Tree. Then PhyloXML::Tree would inherit from Bio::FamilyTree.
This way PhyloXML writer probably would be even faster because it
would not need to update Bio::Pathway structure (which is under
Bio::Tree) every time adding a node or edge.
Additionally, I think BioRuby would benefit from general
Bio::FamilyTree class. I recently heard a talk by researcher who did
phylogenetic analysis of musical rhythms.

Also I will write method to convert from newick to PhyloXML.

What do you think?

Cheers,
Diana

On Mon, Nov 16, 2009 at 5:11 AM, Jan Aerts <jan.aerts at gmail.com> wrote:
> All,
>
> I think we should make a good effort of merging Diana's code into the
> bioruby codebase. Even though I'm not completely familiar with
> bioruby's phylo implementation, an effort like hers should be welcomed
> with open arms.
>
> If her code speeds things up so immensely, why don't we start a new
> branch that will lead to bioruby 2.0? Let bioruby 2.0 break things.
> With a major new release things are allowed to be broken free from the
> legacy code.
>
> We definitely don't want Diana's efforts be in vain.
>
> jan.
>
> 2009/11/8 Naohisa Goto <ngoto at gen-info.osaka-u.ac.jp>:
>> Hi Diana,
>>
>> I'm sorry that the changes cannot be accepted, because the
>> modification of existing Bio::Tree methods breaks things.
>> Bio::Tree does not want to have children/parent information
>> in nodes. One of the reasons is that it is difficult to keep
>> consistency when copying a tree. Nodes can be shared with two
>> or more trees when copying a tree by using "dup" or "clone"
>> method.
>>
>> Normally, tests for existing classes shold not be modified
>> except when changing specification or the test's bug, because
>> they guarantee specification of the class. Adding new tests
>> are OK.
>>
>> If you really want nodes to have parent/children information
>> in each node, please do so in only PhyloXML classes (though
>> I'm negative).  In this case, the problem is that reading phyloxml
>> data and write back again seems good, but it seems there are
>> currently no way to convert Bio::Tree to PhyloXML. Now, it seems
>> hard to convert Newick data to PhyloXML.
>>
>> Now, to prepare to include your PhyloXML code in BioRuby, I'm working
>> on my branch. Some API changes will be made.
>> http://github.com/ngoto/bioruby/tree/incoming
>>
>> Note that in your test code, argument order of assert_equal is wrong.
>> I've already fixed in my branch.
>> http://github.com/ngoto/bioruby/commit/a291af62ef262ee04f3a0e1b6415d4e256c56a94
>>
>>> * Updated  Tree::add_edge, Tree::clear_edge, Tree::remove_edge,
>>> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep
>>> track of Node::parent and Node::children nodes correctly.  Have I
>>> forgotten anything?
>>
>> Changing root with tree.root=().
>>
>> --
>> Naohisa Goto
>> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
>>
>>
>>> Hi all,
>>>
>>> So finally I have updated Bio::Tree and Bio::Node classes to improve
>>> the phyloxml writer speed.
>>>
>>> * Added Bio::Node::parent and  Bio::Node::children (array of nodes) in
>>> order to avoid calling Tree::parent(node) or Tree::children(node),
>>> because those methods call breath first search on the underlying
>>> graph, which makes PhyloXML writer and parser incredibly slow. In
>>> contrast, Bio::Node::parent and Bio::Node::children keeps references
>>> to the respective nodes.
>>> * Updated  Tree::add_edge, Tree::clear_edge, Tree::remove_edge,
>>> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep
>>> track of Node::parent and Node::children nodes correctly.  Have I
>>> forgotten anything?
>>> * Now for PhyloXML writer it takes less than 1 second instead of
>>> ~20minutes to write ncbi_taxonomy_mollusca.xml file 1.5MB
>>> * To write the tree of life taxonomy file (~46MB) it takes 10 seconds
>>> (On 2.4GHz, 2.9GB RAM, running Ubuntu)
>>>
>>> The code is in http://github.com/latvianlinuxgirl/bioruby/tree/tree_class
>>>
>>> I wrote unit tests for my changes and made sure my changes don't break
>>> anything else. However, does anybody has code laying around that uses
>>> Tree::parent and Tree::children methods so that I can test it more
>>> thoroughly?
>>>
>>> Cheers,
>>> Diana
>>> _______________________________________________
>>> BioRuby mailing list
>>> BioRuby at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioruby
>>
>> _______________________________________________
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
>>
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>