[BioRuby] RFC Caching (was BioRuby standards)

Thu Sep 25 10:58:17 EDT 2008

Hi,

On Wed, 24 Sep 2008 18:29:24 +0200
pjotr2008 at thebird.nl (Pjotr Prins) wrote:

> Hi Naohisa,
> 
> On Wed, Sep 24, 2008 at 10:38:19PM +0900, Naohisa GOTO wrote:
> > Hi Pjotr,
> > 
> > I've seen files in your lib/bio/db/microarray, and I suppose
> > it's still under development and it will be changed frequently,
> > and I think it's not a time to include them in main bioruby.
> > So, my comments below are mainly for future improvements.
> 
> What there is is 'stable'. Certainly the NCBI stuff is rather complete. The
> biolib libraries could go in later. It is up to you, but I think it would be
> nice to have mainstream microarray support before one of the other Bio*
> libraries (and biolib support is there for all). We don't want to be beaten by
> BioPerl, for one ;-). If nothing else I can make a BioRuby-with-Microarrays gem
> available - but that may be confusing for others.

I agree it is good to have microarray support, if it is useful.
Could you please show short examples and use cases of the
microarray support?

> Another thing, what is the point of open source software if no one tests it.
> How about regularly releasing a testing version of bioruby? We see some more
> activity in BioRuby - which is a good thing. You can't expect things to be
> ready from the word GO!

I think new version should be released soon, but
currently, there is no release management.

> Meanwhile, I do appreciate your comments. It is forcing me to write better
> code. Teaching an old fox new tricks ;-)
> 
> > 1. about cache.rb
> > 
> > The "safe = true" argument in 'set' and 'directory' seems
> > bad idea. I think there is no need to give insecure options
> > to users.
> 
> I'll remove it if you wish. I think it is up to the implementor - if you have a
> web service you better use the default safe mode. Otherwise, who cares. I, for
> one, would like to use /tmp in some cases.

I wish it is to be removed. Recently, temporary file vulnerability
in software not directly related to server services have also been
treated as security issue, e.g. f2c (fortran to C converter)
http://www.debian.org/security/2005/dsa-661
So, it's good not to give a chance of insecure operation.

> > In 'directory' method, 
> > >  cache = Dir.mktmpdir(subdir)
> > 
> > The Dir.mktmpdir method is a new feature added in Ruby 1.8.7,
> > and not available in 1.8.6 and older versions.
> > Because most users are still using Ruby 1.8.5 and 1.8.6,
> > to avoid using Dir.mktmpdir is currently a choice.
> > Alternatively, write a document that the feature can work
> > only in Ruby 1.8.7 or later.
> 
> Yes we can document that. Using microarray bindings a later Ruby is a
> good idea anyway.

OK.

Question: Does the microarray support work on Ruby 1.9?
Most part of bioruby still do not support Ruby 1.9,
though some code can run on Ruby 1.9.

> > Note that current requirement of BioRuby is
> > "Ruby 1.8.2 or later (Ruby 1.8.4 or later is recommended)".
> > Also note that FileUtils.remove_entry_secure was introduced
> > in Ruby 1.8.3.
> 
> Well, the modules are optionally included. It shouldn't break if
> people don't use the microarray stuff. This is true for the dependency
> on external biolib too.

OK.

> > Finally, I'm wondering if the Cache class can still be
> > a singleton or not in the future. Currently, only NCBI_GEO
> > is using the cache, but if it were used from many classes
> > with different data formats, files with different formats
> > would be existed in the same cache directory, and file name
> > conflicts might be happened.
> 
> This implementation is such that we create a shared dir, with classes using
> different subfolders - i.e. tmpdir/GEO/. This prevents name clashes between
> modules. My current GEO cache is 30 Mb. If I were to download that every time
> my research would be severely hampered. I think it is very useful and could
> also be for running webservices of other modules. You don't want web servers
> to retain everything in memory.

In the current implementation, the singleton object stores
@subdir, and it is the same as a global variable.
For example, If a user want to get both GEO and ArrayExpress
(hopefully supported in the future), and I wrote a code
like this:

   Bio::Microarray::Cache.set('/home/who/.bioruby-cache')
   obj1 = Bio::Microarray::GEO::GSE.new('GSE1')
   obj2 = Bio::Microarray::ArrayExpress.new('Acc2')
   obj3 = Bio::Microarray::GEO::GSE.new('GSE3')
   obj4 = Bio::Microarray::ArrayExpress.new('Acc4')

In this case, how to specify sub directory?
Or, am I misunderstanding @subdir?

BTW, FYI, there is memcached, on-memory cache for web server.
http://www.danga.com/memcached/

> > 2. About file locations
> > 
> > Below are recommended to be moved to bio/io/,
> > because their main purpose is file or network I/O,
> > and not data parsing.
> >     bio/db/microarray/cache.rb 
> 
> OK.
> 
> >     Bio::Microarray::GEO::XML in bio/db/microarray/ncbi_geo/geo.rb
> 
> It does NCBI XML parsing - but that is not what you mean?

I meant only XML.create, XML.fetch, and XML.parsexml methods.
But, because they are short, I think again that no need
to move them.

For microarray data, or for large-scale data, because of
efficiency, I can understand that close relationship between
I/O and data format class is needed. However, from the
viewpoint to treat various data from various databases,
separating I/O and data parsing is better, maybe in the future.

> > The class/module names are not needed to be changed.
> > 
> > The files with external dependency to the "biolib" might
> > also be suggested to be moved from bio/db to the other
> > location, but no best location found.
> 
> heh - anyone else a suggestiong? The biolib stuff does do microarray loading
> and will do normalization and analysis soon.
> 
> > 3. BIo::Microarray::NCBI_GEO
> > 
> > In bio/db/microarray/ncbi_geo/geo.rb,
> > 
> > >    include REXML
> > 
> > If the aim to include REXML module is only to skip the
> > REXML:: prefix, I don't like to include it in library,
> > because the constants and methods defined in REXML are
> > mixed and they might cause bad side effects.
> > (Note that unlike in a library, it is free to include
> > anything in an application.)
> 
> OK
> 
> > > def XML::create(acc)
> > 
> > In my impression, the method name "XML.create" might be
> > reserved to be used by a method to create XML data structure
> > from scratch or from some data.
> 
> > To define a class method, I like 'def self.create(acc)'
> > because it is easy to change class (module) name.
> 
> It is a class factory. I'll have a think.

I suggest Bio::Microarray::GEO::XML.new(acc).

> > > def XML::fetch(xmlfn, acc)
> > >    url = "http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=#{acc}&form=xml&view=brief&retmode=xml"
> > 
> > URI escaping is needed, e.g. acc=#{URI.escape(acc)}
> > 
> > >    print "Fetching ",url,"\n" if $VERBOSE
> > >    r = Net::HTTP.get_response( URI.parse( url ) )
> > 
> > To support proxy, use Bio::Command.get_uri(url).
> 
> OK and OK
> 
> > > def XML::valid_accession?(acc = nil)
> > >   acc = @acc if not acc
> > >   acc =~ /^(GSM|GSE|GPL)\d+$/
> > 
> > If "GSM0123\nGSM4567" is invalid, the regular expression
> > should be /\A(GSM|GSE|GPL)\d+\z/ .
> 
> good point.
> 
> > > def XML::parsexml(acc)
> > 
> > Is there no way to get input XML data as String?
> 
> Sigh. Sure there is. Of from a file. An IO object would be cool. 
> Maybe the next version.
> 
> > >   if XML::valid_accession? acc
> > >     cache = Cache.instance.directory
> > >     fn = cache+'/'+acc+'.xml'
> > 
> > Please use File.join.
> 
> Sorry. OK.
> 
> Pj.
> 

Thanks,

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org