[BioRuby] Update on phyloXML support for BioRuby project

Wed May 20 06:09:17 UTC 2009

Hi all,

On Tue, 19 May 2009 17:07:59 -0400
Diana Jaunzeikare <rozziite at gmail.com> wrote:

> So, I think we have reached consensus that the best choice is libxml2-ruby
> SAX based XML parser.

In libxml2-ruby, I think LibXML::XML::Reader is the best choice,
because it is memory efficient than DOM and its API is simpler
than that of SAX. LibXML::XML::SAXParser is not bad, but I wonder
if the SAX's callback based API makes our codes too complex and
difficult to maintain.

> Since BioRuby has Tree class ( http://bioruby.org/rdoc/) it seems 
> logical that the parser should return a Tree class object. By using 
> SAX parser we avoid the problem of having whole XML file in memory, 

I think so.
Alternative way is to return an object of wrapper class which mimics
Bio::Tree's API. However, it may be too hard to implement such class,
and data type conversion from/to Bio::Tree is still needed even in
this case. So, I think to return a Bio::Tree object is good.

> I am a little confused about the require statements in BioRuby classes. It
> looks like bio/tree.rb should hold a general class, but it requires
> bio/db/newick.rb, but this file in turn requires bio/tree.rb.

The only reason why bio/tree.rb requires bio/db/newick.rb is
for the Newick and NHX output of the tree.  The codes will
be refactored in the future.

On Tue, 19 May 2009 14:54:18 -0700
Christian M Zmasek <czmasek at burnham.org> wrote:

> Hi, Diana:
> 
> I think it is a good idea to have the parser return one tree at a time, 
> as opposed to returning a list of trees.

I think so.

> On the other hand, the same does not apply to nodes. I think it is 
> perfectly acceptable to expect to have enough memory to keep at least 
> one tree in memory (a good target size might be a binary tree with 
> ten-thousand external nodes and 200 bytes of annotation per node, which 
> according to my rough calculations would require less than 5MB).
> 
> For your tree use cases, important ones to add are:
> * iteration over all nodes
> * retrieval/finding of specific nodes according to some criterion (e.g. 
> find all nodes for which the species is "E. coli")
> * tree reconciliation (e.g. compare a gene tree to a species tree, in 
> order to determine duplications on the gene tree)
> 
> In any case, all these applications/algorithms will be most time 
> efficient and easiest to implement with trees which are completely in 
> memory.

In addition, it is easy to implement manipulation of trees
(adding/deleting nodes and edges, etc.).

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org