[BioRuby] [GSoC][NeXML and RDF API] Code Review.

Sat Jul 3 10:25:41 EDT 2010

On Sat, Jul 03, 2010 at 02:37:45AM +0530, Anurag Priyam wrote:
> >
> > I don't understand what you want to do, and how the const_get shorten
> > your code. Please show example code.
> >
> >
> Taking matrix for example, in each type of Matrix I have defined an add
> add_row method which accepts a row which would be of the same kind(
> DnaSeqMatrix will take DnaSeqRow ) :

so, why do you need to test for the type? see below.

>       def add_row( row )
>         raise InvalidRowException, "DnaSeqRow expected." unless
> row.instance_of? DnaSeqRow
>         row_set[ row.id ] = row
>       end
> 
> If instead I define a add_row method in SeqMatrix like this:
> 
> def add_row( row )
>   # a DnaSeqMatrix will take a DnaSeqRow
>   type = self.class.to_s.sub( /Matrix/, 'Row' )
>   klass = NeXML.const_get( type )
>   raise InvalidRowException, "#{type} expected." unless row.instance_of?
> klass
> end
> 
> This way I won't have to define add_row for each sub type of SeqRow.
> Similarly for others.

This would be a factory, right?

I think what you want to do is good in its objective - trying to
shorten the implementation. But do you really need
class inspection/reflection here? Asking for the class name is usually
prevented by having proper attributes in the classes. That is if you
use OOP. Question is whether you really require this.

The problem with the code I had (and have) was the really wide and
deep use of OOP classes. That led to duplication of code and little
'feel' for correctness of what is in there. Deep OOP hierarchies are
evil. Duplication is ugly.

Inspection/reflection is evil too - like Naohisa reacted, pretty much
- it is only used in exceptional cases when there is no other elegant
way of resolving issues. It can be powerful, but only use when really
required, as other people often fail to understand what it does - and
code should be self-documenting.

I think you need to ask more fundamental questions to yourself.

Why not use BioRuby basic types for most data represented by NeXML?
Only use special objects when there is real added value. So DnaSeqRow
would simply be a Sequence (or even list of char) and DnaSeqMatrix
would be a list of Sequence. If you have further attributes create a
new composite object (like SequenceFeatures, or if you think more
functionally again a tuple of sequence(s) and features?).  This way
you don't create a hierarchy that booms into hundreds of specialized
object we won't use elsewhere. To differentiate between a DnaSequence
and RnaSequence you do not need different objects. Both are strings
(in BioRuby). You could even settle for Ruby's primitive types and
containers.

Likewise, even if you need a Matrix, you don't need RnaMatrix and
DnaMatrix. I am sure of that. They are only specializations in name,
the code in there should be identical.

If you go down the OOP route, make use of Ruby's mixin's. Search
Google for "ruby mixin deep oop hierarchy".

My recommendation is to refactor the library to use as primitive a
type as possible, at every point. When you run into functionality that
requires a more complex type, because there is no other way - that is
the moment to design and add it.

I don't know the full depth of the NeXML format, but I can predict
it consists of primitive types in ordered ways. This can be mirrored
by the implementation. If you do it like this you won't have to use
inspection (like above question). OOP classes are for harnessing
special functionality that go with a certain type. Do not create a type
unless you need something special. 

You can propose changes to existing BioRuby types - in particular
with the RDF implementation.

I know some people will balk at this rewrite - but to be honest, if
you want your library to be useful to others it needs rethinking. I
would take a week out of your plan to experiment with different object
models - just start with a small subset. When you think something
works, roll it out all the way. That can be done quickly. Read, read,
read on the Internet about object models.

One thing you can consider is to use an intermediate object structure
for parsing the XML into Ruby - and next fork it out into logical
data structures. I do that regularly as the XML 'model' does not
normally map to Ruby well. One example of mine is here

  http://github.com/pjotrp/swig2doc/blob/master/lib/input/doxyxmlparser.rb

Doxy objects are stored in

  http://github.com/pjotrp/swig2doc/blob/master/lib/cobj/doxy/doxycobjs.rb

Note swig2doc also contains a convenience class for using libxml2 in

  http://github.com/pjotrp/swig2doc/blob/master/lib/input/xmleasyreader.rb

And while you are at refactoring, why not make sure the parser does
not fill memory.

Pj.

PS. Are you using another NeXML OOP implementation as a model - Perl,
Python, Java? I would like to know, so I can have a look.