[BioRuby] [GSoC][NeXML and RDF API] Code Review.

Fri Jun 25 02:23:34 EDT 2010

> Can someone who has worked with NeXML (Rutger) have a look at the
> interface description on:
>
>  https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby
>
> It looks natural to query this way, if you know what the NeXML files
> contains (e.g. trees, or sequences). What would be the natural
> approach if you do *not* know the contents? I.e. how does one iterate
> over the NeXML object?
>
>
NeXML has three primary elements: otus, trees, characters. All three of them
are container for other elements: otu, tree, network, matrix. Currently the
each method of an nexml object iterates over each tree object. I did this
thinking that tree is the most important part of a phylogenetic analysis(
and also because I had not implemented characters then). What were you
thinking here? Should each iterate over all otu, tree and matrix or the
primary otus, trees and characters elements? I would go for the later.

> Anurag, your web page states you implemented a LibXML::Parser, and you
> named it Parser. Meanwhile, it looks like you have implemented libxml2
> streaming, using a Reader. This is a bit confusing. I presume you are
> using the technique used in Diana's PhyloXML parser. You are requiring
> the 'xml' package. Is that libxml2 these days, or is it actually
> 'libxml'? Does it work for all Ruby versions? libxml is an external
> (binary) dependency, so it may not exist and fail.  PhyloXML does not
> handle failure either.
>
>
I am glad you asked this. I wanted to discuss it here.

I have used libxml2 streaming api, without actually streaming the document
to the user. The cursor does not move through the document when you iterate
over elements( phyloxml does that ). I am parsing the document at one go; at
the start, and storing the objects in memory. Should we want to switch to
streaming, using libxml's streaming API from start should make it easier.

Yes it is libxml2 these days. The site states that it works with ruby 1.8. I
am myself working with 1.8.7. I will have to test the compatibility with
ruby 1.9.

> The other high-level questions concern testing. For others, the unit
> tests are here:
>
>  http://github.com/yeban/bioruby/tree/master/test/unit/bio/db/nexml/
>
> I notice you have limited test input data. How can you be really sure
> your code works for all cases? How can you be really sure that future
> changes to the code don't break?

Right. I am working on improving the test suites taking lessons from the
other bioruby test suites.

> And how are you going to measure
> performance of your code?
>
>
Actually I have not done anything here. I will benchmark and profile the
code and discuss the results here.

Finally, getting down to some code. Most of the code is in a single
> file:
>
>  http://github.com/yeban/bioruby/blob/master/lib/bio/db/nexml/elements.rb
>
> or
>
>
> http://github.com/yeban/bioruby/blob/3abfc592e2f7072a8e2970ee077677a9ab7564ae/lib/bio/db/nexml/elements.rb
>
> I think it should be broken up. It would be logical to split by type
> of elements - at least. I know in BioRuby we are ambiguous about file
> sizes - I think a single file should describe one concept. That way
> file names become self describing. Files larger than 300 lines tend
> to be hard to digest - and probably point out some bigger issue.
>
>
Agreed.

> Also, when I look at DnaSeqRow, RnaSeqRow and others derived from
> SeqRow (line 2148 and onwards in element.rb), I can see duplicated
> coding 'patterns'. You are repeating a concept. Would there not be a
> more elegant way in Ruby to handle this? Hint: Inheritance is just one
> mechanism, I see no real reason to use an inheritance tree. Why not
> use one Sequence class for all of these which can contain different
> formed elements? I bet the code would become a lot shorter and
> (probably) less error prone. Take Ruby's Array container class as an
> example - it is just one implementation of a container which allows
> many types of elements.
>

The idea here was to implement a type system and stick close to the class
hierarchy followed in the schema. However, looking back, I myself do not
find the code for the Matrix class very elegant.

A final comment for this session: The class/method descriptions are
> not very informative. It may be early days - especially since we can
> see some refactoring coming, but it usually helps to write out
> examples giving the 'nicest' interface for people to use. And stick
> those in the source code. Personally I favour rubydoctests, see
>
>  http://github.com/tablatom/rubydoctest
>
>
Hey, I did not know that doctests existed for Ruby too. I will have a look
into it.

> I used these in bio/appl/paml/codeml/report.rb - these are examples
> that double as tests. Kill two birds with one stone! The BioRuby
> tutorial also uses doctests - i.e. the code in the Tutorial can be
> validated against the installed bioruby. If you want to use this you
> need an extra conversion - I have that tool.
>

I will check out the examples. What tool? I would like to know more.

>
> Another possibility is to start using RSpec.
>
>  http://rspec.info/
>
> I really like RSpec too - it is more of a replacement for unit
> tests - and easier to understand, so Specs double as documentation.
>
>
I am missing Rspec too from my Rails and Merb days. I picked up unit tests
because much of the framework had used the same and also because I wanted to
try it out :).

> I am interested to see what you want to do for RDF support. Maybe you
> can write out the API as an RSpec? That would be a good start.
>
>
That sounds like a nice idea.

> Do not hesitate to stand up to me. You will probably get support from
> someone on this list ;)
>
> Pj.
>
> On Thu, Jun 10, 2010 at 01:19:35AM +0530, Anurag Priyam wrote:
> > Last week, I worked on finishing implementation of Trees: trees, tree,
> > network; and started work on the characters element. This weeks target is
> to
> > complete the implementation of the characters element.
> >
> > It would be awesome to have some code review including: implementation,
> API
> > design, coding style and tests. I am planning to give a good amount of
> time
> > in the fourth week in making the code more robust. It would make perfect
> > sense to have some feedback to serve as guidelines :). The master branch
> and
> > API discussion page are at:
> >
> > [1] http://github.com/yeban/bioruby
> > [2]
> >
> https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby
> >
> > --
> > Anurag Priyam,
> > 2nd Year Undergraduate,
> > Department of Mechanical Engineering,
> > IIT Kharagpur.
> > +91-9775550642
> > _______________________________________________
> > BioRuby Project - http://www.bioruby.org/
> > BioRuby mailing list
> > BioRuby at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioruby
>

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642