[BioRuby] RFC Caching (was BioRuby standards)

Wed Sep 24 12:29:24 EDT 2008

Hi Naohisa,

On Wed, Sep 24, 2008 at 10:38:19PM +0900, Naohisa GOTO wrote:
> Hi Pjotr,
> 
> I've seen files in your lib/bio/db/microarray, and I suppose
> it's still under development and it will be changed frequently,
> and I think it's not a time to include them in main bioruby.
> So, my comments below are mainly for future improvements.

What there is is 'stable'. Certainly the NCBI stuff is rather complete. The
biolib libraries could go in later. It is up to you, but I think it would be
nice to have mainstream microarray support before one of the other Bio*
libraries (and biolib support is there for all). We don't want to be beaten by
BioPerl, for one ;-). If nothing else I can make a BioRuby-with-Microarrays gem
available - but that may be confusing for others.

Another thing, what is the point of open source software if no one tests it.
How about regularly releasing a testing version of bioruby? We see some more
activity in BioRuby - which is a good thing. You can't expect things to be
ready from the word GO!

Meanwhile, I do appreciate your comments. It is forcing me to write better
code. Teaching an old fox new tricks ;-)

> 1. about cache.rb
> 
> The "safe = true" argument in 'set' and 'directory' seems
> bad idea. I think there is no need to give insecure options
> to users.

I'll remove it if you wish. I think it is up to the implementor - if you have a
web service you better use the default safe mode. Otherwise, who cares. I, for
one, would like to use /tmp in some cases.

> In 'directory' method, 
> >  cache = Dir.mktmpdir(subdir)
> 
> The Dir.mktmpdir method is a new feature added in Ruby 1.8.7,
> and not available in 1.8.6 and older versions.
> Because most users are still using Ruby 1.8.5 and 1.8.6,
> to avoid using Dir.mktmpdir is currently a choice.
> Alternatively, write a document that the feature can work
> only in Ruby 1.8.7 or later.

Yes we can document that. Using microarray bindings a later Ruby is a
good idea anyway.

> Note that current requirement of BioRuby is
> "Ruby 1.8.2 or later (Ruby 1.8.4 or later is recommended)".
> Also note that FileUtils.remove_entry_secure was introduced
> in Ruby 1.8.3.

Well, the modules are optionally included. It shouldn't break if
people don't use the microarray stuff. This is true for the dependency
on external biolib too.

> Finally, I'm wondering if the Cache class can still be
> a singleton or not in the future. Currently, only NCBI_GEO
> is using the cache, but if it were used from many classes
> with different data formats, files with different formats
> would be existed in the same cache directory, and file name
> conflicts might be happened.

This implementation is such that we create a shared dir, with classes using
different subfolders - i.e. tmpdir/GEO/. This prevents name clashes between
modules. My current GEO cache is 30 Mb. If I were to download that every time
my research would be severely hampered. I think it is very useful and could
also be for running webservices of other modules. You don't want web servers
to retain everything in memory.

> 2. About file locations
> 
> Below are recommended to be moved to bio/io/,
> because their main purpose is file or network I/O,
> and not data parsing.
>     bio/db/microarray/cache.rb 

OK.

>     Bio::Microarray::GEO::XML in bio/db/microarray/ncbi_geo/geo.rb

It does NCBI XML parsing - but that is not what you mean?

> The class/module names are not needed to be changed.
> 
> The files with external dependency to the "biolib" might
> also be suggested to be moved from bio/db to the other
> location, but no best location found.

heh - anyone else a suggestiong? The biolib stuff does do microarray loading
and will do normalization and analysis soon.

> 3. BIo::Microarray::NCBI_GEO
> 
> In bio/db/microarray/ncbi_geo/geo.rb,
> 
> >    include REXML
> 
> If the aim to include REXML module is only to skip the
> REXML:: prefix, I don't like to include it in library,
> because the constants and methods defined in REXML are
> mixed and they might cause bad side effects.
> (Note that unlike in a library, it is free to include
> anything in an application.)

OK

> > def XML::create(acc)
> 
> In my impression, the method name "XML.create" might be
> reserved to be used by a method to create XML data structure
> from scratch or from some data.

> To define a class method, I like 'def self.create(acc)'
> because it is easy to change class (module) name.

It is a class factory. I'll have a think.

> > def XML::fetch(xmlfn, acc)
> >    url = "http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=#{acc}&form=xml&view=brief&retmode=xml"
> 
> URI escaping is needed, e.g. acc=#{URI.escape(acc)}
> 
> >    print "Fetching ",url,"\n" if $VERBOSE
> >    r = Net::HTTP.get_response( URI.parse( url ) )
> 
> To support proxy, use Bio::Command.get_uri(url).

OK and OK

> > def XML::valid_accession?(acc = nil)
> >   acc = @acc if not acc
> >   acc =~ /^(GSM|GSE|GPL)\d+$/
> 
> If "GSM0123\nGSM4567" is invalid, the regular expression
> should be /\A(GSM|GSE|GPL)\d+\z/ .

good point.

> > def XML::parsexml(acc)
> 
> Is there no way to get input XML data as String?

Sigh. Sure there is. Of from a file. An IO object would be cool. 
Maybe the next version.

> >   if XML::valid_accession? acc
> >     cache = Cache.instance.directory
> >     fn = cache+'/'+acc+'.xml'
> 
> Please use File.join.

Sorry. OK.

Pj.