[BioRuby] RFC Caching (was BioRuby standards)
Naohisa GOTO
ngoto at gen-info.osaka-u.ac.jp
Thu Sep 25 14:58:17 UTC 2008
Hi,
On Wed, 24 Sep 2008 18:29:24 +0200
pjotr2008 at thebird.nl (Pjotr Prins) wrote:
> Hi Naohisa,
>
> On Wed, Sep 24, 2008 at 10:38:19PM +0900, Naohisa GOTO wrote:
> > Hi Pjotr,
> >
> > I've seen files in your lib/bio/db/microarray, and I suppose
> > it's still under development and it will be changed frequently,
> > and I think it's not a time to include them in main bioruby.
> > So, my comments below are mainly for future improvements.
>
> What there is is 'stable'. Certainly the NCBI stuff is rather complete. The
> biolib libraries could go in later. It is up to you, but I think it would be
> nice to have mainstream microarray support before one of the other Bio*
> libraries (and biolib support is there for all). We don't want to be beaten by
> BioPerl, for one ;-). If nothing else I can make a BioRuby-with-Microarrays gem
> available - but that may be confusing for others.
I agree it is good to have microarray support, if it is useful.
Could you please show short examples and use cases of the
microarray support?
> Another thing, what is the point of open source software if no one tests it.
> How about regularly releasing a testing version of bioruby? We see some more
> activity in BioRuby - which is a good thing. You can't expect things to be
> ready from the word GO!
I think new version should be released soon, but
currently, there is no release management.
> Meanwhile, I do appreciate your comments. It is forcing me to write better
> code. Teaching an old fox new tricks ;-)
>
> > 1. about cache.rb
> >
> > The "safe = true" argument in 'set' and 'directory' seems
> > bad idea. I think there is no need to give insecure options
> > to users.
>
> I'll remove it if you wish. I think it is up to the implementor - if you have a
> web service you better use the default safe mode. Otherwise, who cares. I, for
> one, would like to use /tmp in some cases.
I wish it is to be removed. Recently, temporary file vulnerability
in software not directly related to server services have also been
treated as security issue, e.g. f2c (fortran to C converter)
http://www.debian.org/security/2005/dsa-661
So, it's good not to give a chance of insecure operation.
> > In 'directory' method,
> > > cache = Dir.mktmpdir(subdir)
> >
> > The Dir.mktmpdir method is a new feature added in Ruby 1.8.7,
> > and not available in 1.8.6 and older versions.
> > Because most users are still using Ruby 1.8.5 and 1.8.6,
> > to avoid using Dir.mktmpdir is currently a choice.
> > Alternatively, write a document that the feature can work
> > only in Ruby 1.8.7 or later.
>
> Yes we can document that. Using microarray bindings a later Ruby is a
> good idea anyway.
OK.
Question: Does the microarray support work on Ruby 1.9?
Most part of bioruby still do not support Ruby 1.9,
though some code can run on Ruby 1.9.
> > Note that current requirement of BioRuby is
> > "Ruby 1.8.2 or later (Ruby 1.8.4 or later is recommended)".
> > Also note that FileUtils.remove_entry_secure was introduced
> > in Ruby 1.8.3.
>
> Well, the modules are optionally included. It shouldn't break if
> people don't use the microarray stuff. This is true for the dependency
> on external biolib too.
OK.
> > Finally, I'm wondering if the Cache class can still be
> > a singleton or not in the future. Currently, only NCBI_GEO
> > is using the cache, but if it were used from many classes
> > with different data formats, files with different formats
> > would be existed in the same cache directory, and file name
> > conflicts might be happened.
>
> This implementation is such that we create a shared dir, with classes using
> different subfolders - i.e. tmpdir/GEO/. This prevents name clashes between
> modules. My current GEO cache is 30 Mb. If I were to download that every time
> my research would be severely hampered. I think it is very useful and could
> also be for running webservices of other modules. You don't want web servers
> to retain everything in memory.
In the current implementation, the singleton object stores
@subdir, and it is the same as a global variable.
For example, If a user want to get both GEO and ArrayExpress
(hopefully supported in the future), and I wrote a code
like this:
Bio::Microarray::Cache.set('/home/who/.bioruby-cache')
obj1 = Bio::Microarray::GEO::GSE.new('GSE1')
obj2 = Bio::Microarray::ArrayExpress.new('Acc2')
obj3 = Bio::Microarray::GEO::GSE.new('GSE3')
obj4 = Bio::Microarray::ArrayExpress.new('Acc4')
In this case, how to specify sub directory?
Or, am I misunderstanding @subdir?
BTW, FYI, there is memcached, on-memory cache for web server.
http://www.danga.com/memcached/
> > 2. About file locations
> >
> > Below are recommended to be moved to bio/io/,
> > because their main purpose is file or network I/O,
> > and not data parsing.
> > bio/db/microarray/cache.rb
>
> OK.
>
> > Bio::Microarray::GEO::XML in bio/db/microarray/ncbi_geo/geo.rb
>
> It does NCBI XML parsing - but that is not what you mean?
I meant only XML.create, XML.fetch, and XML.parsexml methods.
But, because they are short, I think again that no need
to move them.
For microarray data, or for large-scale data, because of
efficiency, I can understand that close relationship between
I/O and data format class is needed. However, from the
viewpoint to treat various data from various databases,
separating I/O and data parsing is better, maybe in the future.
> > The class/module names are not needed to be changed.
> >
> > The files with external dependency to the "biolib" might
> > also be suggested to be moved from bio/db to the other
> > location, but no best location found.
>
> heh - anyone else a suggestiong? The biolib stuff does do microarray loading
> and will do normalization and analysis soon.
>
> > 3. BIo::Microarray::NCBI_GEO
> >
> > In bio/db/microarray/ncbi_geo/geo.rb,
> >
> > > include REXML
> >
> > If the aim to include REXML module is only to skip the
> > REXML:: prefix, I don't like to include it in library,
> > because the constants and methods defined in REXML are
> > mixed and they might cause bad side effects.
> > (Note that unlike in a library, it is free to include
> > anything in an application.)
>
> OK
>
> > > def XML::create(acc)
> >
> > In my impression, the method name "XML.create" might be
> > reserved to be used by a method to create XML data structure
> > from scratch or from some data.
>
> > To define a class method, I like 'def self.create(acc)'
> > because it is easy to change class (module) name.
>
> It is a class factory. I'll have a think.
I suggest Bio::Microarray::GEO::XML.new(acc).
> > > def XML::fetch(xmlfn, acc)
> > > url = "http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=#{acc}&form=xml&view=brief&retmode=xml"
> >
> > URI escaping is needed, e.g. acc=#{URI.escape(acc)}
> >
> > > print "Fetching ",url,"\n" if $VERBOSE
> > > r = Net::HTTP.get_response( URI.parse( url ) )
> >
> > To support proxy, use Bio::Command.get_uri(url).
>
> OK and OK
>
> > > def XML::valid_accession?(acc = nil)
> > > acc = @acc if not acc
> > > acc =~ /^(GSM|GSE|GPL)\d+$/
> >
> > If "GSM0123\nGSM4567" is invalid, the regular expression
> > should be /\A(GSM|GSE|GPL)\d+\z/ .
>
> good point.
>
> > > def XML::parsexml(acc)
> >
> > Is there no way to get input XML data as String?
>
> Sigh. Sure there is. Of from a file. An IO object would be cool.
> Maybe the next version.
>
> > > if XML::valid_accession? acc
> > > cache = Cache.instance.directory
> > > fn = cache+'/'+acc+'.xml'
> >
> > Please use File.join.
>
> Sorry. OK.
>
> Pj.
>
Thanks,
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
More information about the BioRuby
mailing list