From ngoto at gen-info.osaka-u.ac.jp Mon Jan 4 02:15:18 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Mon, 4 Jan 2010 16:15:18 +0900 Subject: [BioRuby] Codeml parser In-Reply-To: <20091231141546.GA5770@thebird.nl> References: <20091231141546.GA5770@thebird.nl> Message-ID: <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> Hi, I also think the current Bio::PAML::Codeml::Report is needed to be rewritten. It is great if you do so. Here is my comments. > codeml = Bio::PAML::Codeml.new(nil, :runmode => 0, :RateAncestor => 1, > :alpha => 0.5, :fix_alpha => 0) > report = codeml.query(alignment, tree) > > which, as it happens, works. The 'nil' points to the program executable. > 'nil' merely fills in 'codeml'. It would have been beter to make it one > of the listed options, e.g. :binary => 'codeml'. That would save the ugly > 'nil' parameter and belongs more to the principle of least surprise, that > makes Ruby shine. It is safe not to merge bioruby internal options and PAML's options. If the upstream authors of PAML introduced a new option named binary, severe problem would occur. One way is to write a code that acts something like C++ polymorphism. For example, the code below accepts the three cases. * Bio::PAML::Codeml.new("/path/to/codeml") * Bio::PAML::Codeml.new({ :xxx => yyy, :ppp => qqq }) * Bio::PAML::Codeml.new("/path/to/codeml", { :xxx => yyy, :ppp => qqq }) def initialize(*argv) program = nil params = {} case argv.size when 0, 1 begin params = argv[0].to_hash rescue NoMethodError program = argv[0] end when 2 program, params = *argv else raise ArgumentError, "wrong number of arguments (#{argv.size} for 2)" end # continues to the current code... The bad points are: * Complexity of code is increased. * It might make difficult to refactor codes, especially when keyword arguments are introduced in the future version of Ruby. Note that Ruby's author Matz has said that he had not applied the principle of least surprise to the design of Ruby. (http://en.wikipedia.org/wiki/Ruby_(programming_language)#Philosophy ) Please be careful that the word "principle of least surprise (POLS)" is NG word when you request something in Ruby. (http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/26942 ) > A new implementation of Bio::PAML::Codeml::Report > So I propose to rewrite the class supporting for multiple models, > with the following usage (starting from a codeml report - really result): > > >> report.models.size > => 2 > >> report.models[0].name > => "M0" I suppose report.models returns a Hash containing objects of newly written class (for example, Bio::PAML::Codeml::Report::Model) or Struct. It seems good. Existing methods could be changed to return the first model's values. > Unit tests Currently, tests with external dependencies (e.g. web services) are located in the test/functional/ directory. So, your tests running codeml would be named test/functional/bio/appl/paml/test_codeml.rb, test/functional/bio/appl/paml/codeml/test_report.rb, or something like this. > These tests, for example, can be run on a special switch: > > runner.rb --test-dependencies I'm now searching ways to pass such parameters to tests. Note that tests can also be run in various ways. For example, ruby test/unit/bio/appl/paml/codeml/test_report.rb testrb test/unit/bio/appl/paml/codeml rake test > I am sure it works, but doesn't anyone think this belongs in a support > module (e.g. BioTestFile) for testing? What I would like to see is > something less brittle: > > require 'bio/test' > str = BioTestFile::read('paml/codeml/output.txt') I'd like to keep tests simple and clear, and I think using standard File.read is enough and clearer. When using such special class, to know the behavior of the test code, reading extra file is needed. > Personally, I dislike the naming/name space scheme of Bioruby. > What to think of invoking a class named > > report = Bio::PAML::Codeml::Report.new Because there are many bioinformatics software and databases, names tends to be longer, and nesting of namespace tends to be deeper. I'd like to know naming rules and policies of other open-bio projects. > Why can't it just be > > include Bio > report = Codeml.new I think it is enough to write "include Bio::PAML" instead of (or in addition to) "include Bio". > include Bio > result = Paml.new(:program => 'codeml') I don't like introducing such new parameter like :program. I think 1 class 1 binary is better. In addition, because the differences within PAML tools (codeml, baseml, yn00, etc.) are currently not small, merging the classes is not so realistic now. On Thu, 31 Dec 2009 15:15:46 +0100 Pjotr Prins wrote: > Hi Michael, > > I have a writeup on improving the current PAML functionality. Are you > OK with this? > > http://bioruby.open-bio.org/wiki/BIORUBY_PAML > > (maybe it does not belong on the bioruby Wiki - but I think of it > like a 'design' document). > > Pj. > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From pjotr.public14 at thebird.nl Mon Jan 4 04:03:18 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 4 Jan 2010 10:03:18 +0100 Subject: [BioRuby] Bioruby design Message-ID: <20100104090318.GA16136@thebird.nl> Thanks for the reply Naohisa. As we are moving on to design, rather than one implementation I am changing the thread. On Mon, Jan 04, 2010 at 04:15:18PM +0900, Naohisa GOTO wrote: > It is safe not to merge bioruby internal options and PAML's options. > If the upstream authors of PAML introduced a new option named binary, > severe problem would occur. I am against breaking interfaces. This is a minor design problem which should be avoided in the future. And, yes, I would certainly not favour a polymorphism solution, unless unavoidable. I don't think it is worth 'fixing' this interface aspect at this stage. Perhaps, there will be opportunities later. > Note that Ruby's author Matz has said that he had not applied the > principle of least surprise to the design of Ruby. > (http://en.wikipedia.org/wiki/Ruby_(programming_language)#Philosophy ) > Please be careful that the word "principle of least surprise (POLS)" > is NG word when you request something in Ruby. > (http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/26942 ) I did not know that, and personally I do not care. I think POLS is a really good idea, though it should not automatically come at the expense of (for example) convenience, or performance. I favour easy API's, and that is where the principle of least surprise comes in. It means to me that I don't have to fetch the manuals every time (like I do with Perl). So, let's not throw away the baby with the bath water. I like POLS, as much as I like KISS. > > >> report.models[0].name > > => "M0" > > I suppose report.models returns a Hash containing objects of newly written > class (for example, Bio::PAML::Codeml::Report::Model) or Struct. > It seems good. In fact, I have made it an array. See my PAML branch. > > runner.rb --test-dependencies > > I'm now searching ways to pass such parameters to tests. In the runner you can parse the parameters first and pull them off the stack. I did something like that for cfruby: http://cfruby.rubyforge.org/git?p=cfruby.git;a=blob;f=test/runner.rb;h=c202e48783a744c4cb3e339e2b891b3eab354c3e;hb=HEAD > I'd like to keep tests simple and clear, and I think using standard > File.read is enough and clearer. When using such special class, to know > the behavior of the test code, reading extra file is needed. I disagree, but that is obvious. > > Personally, I dislike the naming/name space scheme of Bioruby. > > What to think of invoking a class named > > > > report = Bio::PAML::Codeml::Report.new > > Because there are many bioinformatics software and databases, names > tends to be longer, and nesting of namespace tends to be deeper. > I'd like to know naming rules and policies of other open-bio projects. I think we should not mirror ourselves on these. We can do better. RoR is a much better example to mirror ourselves on. > > Why can't it just be > > > > include Bio > > report = Codeml.new > > I think it is enough to write "include Bio::PAML" instead of (or in > addition to) "include Bio". Not really. It brings in another source of errors for users if they have to think about that context every time. We will get all variants, like Bio::Kegg, Bio::Sequence etc. I think name spaces are there to *avoid* conflict. If a naming scheme precludes conflict, why bring in another layer? I want Bioruby to be as easy as possible, and with the least amount of typing. More text = harder to read. > > include Bio > > result = Paml.new(:program => 'codeml') > > I don't like introducing such new parameter like :program. > I think 1 class 1 binary is better. I agree. It was just another option. > In addition, because the differences within PAML tools (codeml, baseml, > yn00, etc.) are currently not small, merging the classes is not so > realistic now. We have to separate our own conveniences from design choices. Meanwhile I do agree we should not change the current interfaces. We can create a new version of Bioruby with both old and new interfaces supported. That is one thing I propose. I am putting together a discussion document on the future of Bioruby (design choices). We will have opportunity to discuss that in Japan. We can consider raising a community vote once we have a list of options. Pj. From pjotr.public14 at thebird.nl Mon Jan 4 06:51:05 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 4 Jan 2010 12:51:05 +0100 Subject: [BioRuby] Codeml parser In-Reply-To: <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> References: <20091231141546.GA5770@thebird.nl> <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20100104115105.GA21035@thebird.nl> I have updated the writeup at http://bioruby.open-bio.org/wiki/BIORUBY_PAML have a look at my PAML branch. The (old) unit tests pass. http://github.com/pjotrp/bioruby/tree/PAML I have to add the positive selection sites, to complete it. Pj. From tomoakin at kenroku.kanazawa-u.ac.jp Mon Jan 4 07:33:20 2010 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Mon, 4 Jan 2010 21:33:20 +0900 Subject: [BioRuby] Bioruby design In-Reply-To: <20100104090318.GA16136@thebird.nl> References: <20100104090318.GA16136@thebird.nl> Message-ID: <14BB7DE3-1D1B-4180-91EE-156EAADF4A98@kenroku.kanazawa-u.ac.jp> Hi, > As people tend not to think of Paml as a toolbox I would prefer > to have one object names Paml. With behind it the codeml 'engine' > and reporter. This would work for me (also note Paml does > not return a report, but rather a result): I don't agree in this point. PHYLIP is clearly a package or collection of programs, and so is considered Molphy, PAML, ... > result = Paml.new(:program => 'codeml') And if you make a single object, it is not to obvious to divide based on the program, since aaml is now done by codeml but should be considered clearly different function. >>> include Bio >>> report = Codeml.new >>> >> >> I think it is enough to write "include Bio::PAML" instead of (or in >> addition to) "include Bio". >> > > Not really. It brings in another source of errors for users if they > have to think about that context every time. We will get all > variants, like Bio::Kegg, Bio::Sequence etc. These are short enought, since we have to write something like "PAML ver XXX (Yang, XX) was used for XX" and "KEGG (Kanehisa, XXX)"... in the manuscript of the paper if we use that module. Stating their use explicitly in the first lines of the program is considered good. On the other hand, I don't like "include Bio::Sequence", since it is a function of bioruby in itself. -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan From pjotr.public14 at thebird.nl Mon Jan 4 10:04:59 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 4 Jan 2010 16:04:59 +0100 Subject: [BioRuby] Bioruby design In-Reply-To: <14BB7DE3-1D1B-4180-91EE-156EAADF4A98@kenroku.kanazawa-u.ac.jp> References: <20100104090318.GA16136@thebird.nl> <14BB7DE3-1D1B-4180-91EE-156EAADF4A98@kenroku.kanazawa-u.ac.jp> Message-ID: <20100104150459.GB21412@thebird.nl> On Mon, Jan 04, 2010 at 09:33:20PM +0900, Tomoaki NISHIYAMA wrote: > These are short enought, since we have to write something like > "PAML ver XXX (Yang, XX) was used for XX" and "KEGG (Kanehisa, XXX)"... > in the manuscript of the paper if we use that module. > Stating their use explicitly in the first lines of the > program is considered good. Uhm. I think that is a bit far fetched. The way you propose it is that you would have to load the name space every time you use something in code: require 'bio' include Bio::PAML include Bio::Kegg include ... do something next source file, the same. And again: require 'bio' include Bio::PAML include Bio::Kegg include ... do something This is the philosophy of Python - where every source file explicitly loads all modules/name spaces. It is arguably 'clear'. But ugly. And, takes the fun out of programming (anyone mention that?). Only once I have used the Python name spacing with good effect. It was when we plugged in a replacement module - completely rewritten. That was changing one line only - and it worked :-). In Python you can say import Paml as paml it became import Paml2 as paml That was nice. But whan you see Python source files, the header is ugly, and wastes a lot of typing. See for example: http://pypi.python.org/pypi/zope.sqlalchemy#example I argue not to state imports. import Bio should be part of require 'bio' Anyway, we will have time to talk in Tokyo, I hope. Pj. P.S. Do you have an example of anyone quoting a Bioruby module in a paper? From pjotr.public14 at thebird.nl Mon Jan 4 12:09:04 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 4 Jan 2010 18:09:04 +0100 Subject: [BioRuby] Codeml parser In-Reply-To: <20100104115105.GA21035@thebird.nl> References: <20091231141546.GA5770@thebird.nl> <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100104115105.GA21035@thebird.nl> Message-ID: <20100104170904.GA26187@thebird.nl> The writeup is pretty much done, as well as the implementation. http://bioruby.open-bio.org/wiki/BIORUBY_PAML All unit tests pass: Running tests for PAML Loaded suite . Started .................... Finished in 0.398394 seconds. 20 tests, 37 assertions, 0 failures, 0 errors It is compatible with the old version. I have added 41 assertions in the doctest (the header of report.rb). === Testing 'mydoc.test'... 1. OK | Default Test 41 comparisons, 1 doctests, 0 failures, 0 errors You can view the tests and implementation at http://github.com/pjotrp/bioruby/blob/PAML/lib/bio/appl/paml/codeml/report.rb See also The branch is: http://github.com/pjotrp/bioruby/tree/PAML (don't you love github). Pj. From mail at michaelbarton.me.uk Mon Jan 4 12:50:50 2010 From: mail at michaelbarton.me.uk (Michael Barton) Date: Mon, 4 Jan 2010 12:50:50 -0500 Subject: [BioRuby] Codeml parser In-Reply-To: <20100104170904.GA26187@thebird.nl> References: <20091231141546.GA5770@thebird.nl> <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100104115105.GA21035@thebird.nl> <20100104170904.GA26187@thebird.nl> Message-ID: Hi Pjotr, The expand report.rb looks like an excellent and substantial improvement to the previous version. You could add a depreciated tag to the old interface methods and these could then be removed in a later bioruby version to decrease clutter in the API. Mike 2010/1/4 Pjotr Prins : > The writeup is pretty much done, as well as the implementation. > > ?http://bioruby.open-bio.org/wiki/BIORUBY_PAML > > All unit tests pass: > > ?Running tests for PAML > ?Loaded suite . > ?Started > ?.................... > ?Finished in 0.398394 seconds. > ?20 tests, 37 assertions, 0 failures, 0 errors > > It is compatible with the old version. I have added 41 assertions > in the doctest (the header of report.rb). > > ?=== Testing 'mydoc.test'... > ?1. ? OK ?| Default Test > ?41 comparisons, 1 doctests, 0 failures, 0 errors > > You can view the tests and implementation at > > ?http://github.com/pjotrp/bioruby/blob/PAML/lib/bio/appl/paml/codeml/report.rb > See also > > The branch is: > > ?http://github.com/pjotrp/bioruby/tree/PAML > > (don't you love github). > > Pj. > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From ngoto at gen-info.osaka-u.ac.jp Tue Jan 5 02:42:49 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Tue, 5 Jan 2010 16:42:49 +0900 Subject: [BioRuby] Codeml parser In-Reply-To: <20100104170904.GA26187@thebird.nl> References: <20091231141546.GA5770@thebird.nl> <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100104115105.GA21035@thebird.nl> <20100104170904.GA26187@thebird.nl> Message-ID: <20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp> Hi Pjotr, I'm reading the code (commit c2de9dd3ad055bab4bfb1d3e8da840493b110b0e). It is generally good. Below are my comments and suggested changes. > # == Examples > # > # Read the codeml M0-M3 data file into a buffer > # > # >> require 'bio/test/biotestfile' > # >> buf = BioTestFile.read('paml/codeml/models/results0-3.txt') It is not suitable to use such nonstandard class in the example. Users want to know the example usage and do not intend to test. Note that I still disagree with the BioTestFile class. > class Report < Bio::PAML::Common::Report > > attr_reader :models, :header, :footer RDoc documentation is also needed for attributes. To write RDoc, the three attribute definitions are needed to be separated. For example, # Models in the result # (Array containing Bio::PAML::Codeml::Model objects) attr_reader :models # ...(should be written) attr_reader :header # ...(should be written) attr_reader :footer > # Parse codeml output file passed with +buf+ > def initialize buf Details of +buf+ (class, contents, etc) should also be written in RDoc. It is recommended to use the style written in the README_DEV.rdoc, or the style used in the Ruby source code. Please do not omit parentheses in the method definition lines. > # Model class > class Model Too few documentation. At least please write a message that it is created by Bio::PAML::Codeml::Report. > def initialize buf Please write RDoc that normal users do not use the method directly, and internally called inside the Bio::PAML::Codeml::Report objects. Please do not omit parentheses in the method definition lines. > def lnL Writing RDoc document is needed. In addition, for omega, kappa, alpha, tree_length, tree, and to_s methods. > class PositiveSite Almost all methods have no RDoc documantation. > def to_a > [ @position, @aaref, @probability, @omega ] > end What is the purpose of the method? > class PositiveSites < Array To inherit Array and to create original container class is discouraged. In BioRuby, we have deprecated Bio::Features and Bio::References in version 1.3.0, although they do not inherit Array but have an array in the object. (The classes still exist only for backward compatibility, in lib/bio/compat/features.rb and references.rb). In this case, except initialize, only a method named "graph" is added. I think it is good to add the graph method in the Report class and using an Array for storing PositiveSite objects. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From pjotr.public14 at thebird.nl Tue Jan 5 05:32:12 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 5 Jan 2010 11:32:12 +0100 Subject: [BioRuby] Codeml parser In-Reply-To: <20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp> References: <20091231141546.GA5770@thebird.nl> <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100104115105.GA21035@thebird.nl> <20100104170904.GA26187@thebird.nl> <20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20100105103212.GA4584@thebird.nl> Hi Naohisa, First I thought you were kidding. But then I realise you are serious. I don't think we need to document every simple class variable/accessor to accept this source code. That is overkill. If you don't understand lnL or alpha, don't use it. We are not in the business of documenting for documenting's sake. Documenting lnL and alpha will be like: "Retrieve the lnL value from the Report" "Retrieve the alpha value from the Report" etc. etc. I don't think we should be doing that. Standard 1?to-1 relations are obvious and don't need lots of text in the code base. If someone feels like filling in these obvious statements, fine. It really goes against my grain. Do we document every single accessor? Note the previous implementation did no such thing. That code was accepted fine (and partially written by you). > Details of +buf+ (class, contents, etc) should also be written in RDoc. > It is recommended to use the style written in the README_DEV.rdoc, or > the style used in the Ruby source code. You mean the contents of the input buffer, which is the content of the input file? I see many places in Bioruby where no such a thing is done. Why become strict on this now? If you want a different descriptive name for the variable - that is fine. Propose me a better name. > > def to_a > > [ @position, @aaref, @probability, @omega ] > > end > What is the purpose of the method? Access converter. Convenience, really. You can remove it if you dislike it so much. I use it for testing and to write to a file. Could be to_s too, but that fixates the format. > > class PositiveSites < Array > > To inherit Array and to create original container class is discouraged. > In BioRuby, we have deprecated Bio::Features and Bio::References in > version 1.3.0, although they do not inherit Array but have an array > in the object. (The classes still exist only for backward compatibility, > in lib/bio/compat/features.rb and references.rb). PositiveSites object has the all the features of a list (ie Array). I think inheritance is what it should be. It is an is_a relationship. Adding a @list will just add code. Not only for initialization, but also for iterators. I only see how we can move backwards from readable code. Nor is it good OOP practice. Inheritance is not *always* bad, though I agree it is used too quickly (in general). > In this case, except initialize, only a method named "graph" is added. > I think it is good to add the graph method in the Report class and > using an Array for storing PositiveSite objects. This is awful. The graph is a feature of PositiveSites, and not of the report *parser*. To keep things simple it is best practise to have functionality where it belongs. It is good OOP design. Your proposal means the Report class becomes less obvious in what it is. Look how clean it is now! What do other people think on this list. I am at a disadvantage here. I would like this code accepted in Bioruby, so other people can use it. I disagree with most of above 'criticism'. I certainly balk at the last non-OOP ones. This is not the first time I am really unhappy. I can't believe how much trouble I have to go to for a simple class, which, as it happens, has a perfectly acceptable implementation by most measures. Pj. From jan.aerts at gmail.com Tue Jan 5 06:53:53 2010 From: jan.aerts at gmail.com (Jan Aerts) Date: Tue, 5 Jan 2010 11:53:53 +0000 Subject: [BioRuby] Codeml parser In-Reply-To: <20100105103212.GA4584@thebird.nl> References: <20091231141546.GA5770@thebird.nl> <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100104115105.GA21035@thebird.nl> <20100104170904.GA26187@thebird.nl> <20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp> <20100105103212.GA4584@thebird.nl> Message-ID: <4c7507a71001050353v4e4654abn79f9cbcaeef262e2@mail.gmail.com> All, It appears that the pre-hackathon bioruby meeting will be very useful :-) Why don't we use that time to focus on the bit-more-distant future of bioruby: bioruby 2.0? We could discuss what it should look like without having to worry about backward compatibility. Topics: * documentation style (I happen to agree with Naohisa on that) * class hierarchy: how would we organize the information if we had to start from scratch? (maybe we should follow bioperl's lead with a Root class?) * coding style * general interface decisions * ... jan. PS: Still don't know if I can make it to Japan. Will know this afternoon (broken foot might interfere...) 2010/1/5 Pjotr Prins > Hi Naohisa, > > First I thought you were kidding. But then I realise you are serious. > > I don't think we need to document every simple class variable/accessor > to accept this source code. That is overkill. If you don't understand > lnL or alpha, don't use it. We are not in the business of documenting > for documenting's sake. Documenting lnL and alpha will be like: > > "Retrieve the lnL value from the Report" > > "Retrieve the alpha value from the Report" > > etc. etc. I don't think we should be doing that. Standard 1?to-1 > relations are obvious and don't need lots of text in the code base. > > If someone feels like filling in these obvious statements, fine. It > really goes against my grain. Do we document every single accessor? > Note the previous implementation did no such thing. That code was > accepted fine (and partially written by you). > > > Details of +buf+ (class, contents, etc) should also be written in RDoc. > > It is recommended to use the style written in the README_DEV.rdoc, or > > the style used in the Ruby source code. > > You mean the contents of the input buffer, which is the content of the > input file? I see many places in Bioruby where no such a thing is > done. Why become strict on this now? If you want a different > descriptive name for the variable - that is fine. Propose me > a better name. > > > > def to_a > > > [ @position, @aaref, @probability, @omega ] > > > end > > What is the purpose of the method? > > Access converter. Convenience, really. You can remove it if you > dislike it so much. I use it for testing and to write to a file. Could > be to_s too, but that fixates the format. > > > > class PositiveSites < Array > > > > To inherit Array and to create original container class is discouraged. > > In BioRuby, we have deprecated Bio::Features and Bio::References in > > version 1.3.0, although they do not inherit Array but have an array > > in the object. (The classes still exist only for backward compatibility, > > in lib/bio/compat/features.rb and references.rb). > > PositiveSites object has the all the features of a list (ie Array). I > think inheritance is what it should be. It is an is_a relationship. > Adding a @list will just add code. Not only for initialization, but > also for iterators. I only see how we can move backwards from readable > code. Nor is it good OOP practice. Inheritance is not *always* bad, > though I agree it is used too quickly (in general). > > > In this case, except initialize, only a method named "graph" is added. > > I think it is good to add the graph method in the Report class and > > using an Array for storing PositiveSite objects. > > This is awful. The graph is a feature of PositiveSites, and not of the > report *parser*. To keep things simple it is best practise to have > functionality where it belongs. It is good OOP design. Your proposal > means the Report class becomes less obvious in what it is. Look how > clean it is now! > > What do other people think on this list. I am at a disadvantage here. > > I would like this code accepted in Bioruby, so other people can use > it. I disagree with most of above 'criticism'. I certainly balk at the > last non-OOP ones. This is not the first time I am really unhappy. I > can't believe how much trouble I have to go to for a simple class, > which, as it happens, has a perfectly acceptable implementation by > most measures. > > Pj. > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From pjotr.public14 at thebird.nl Tue Jan 5 07:39:02 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 5 Jan 2010 13:39:02 +0100 Subject: [BioRuby] Clustal ALN writer Message-ID: <20100105123902.GA10823@thebird.nl> I propose to write an ALN output writer. ALN files show aligned sequences with additional lines of information (like a match line). I want to use it to output PAML positive selection sites. This is the idea: SEQ1 alignment 1... SEQ2 alignment 2... ...*.:*....*** (match line) ...*....*..... (pos. sel. line) Do we want such ALN output (I think it is allowed), and can we allow for the additional output. I have a proposed interface here: http://github.com/pjotrp/bioruby/commit/7f320781039b56aee991ab72404655fae210e2cb I notice ClustalW.to_fasta has been obsoleted. But we don't have to_aln yet, and we need to allow adding match_lines and other information. Pj. From ngoto at gen-info.osaka-u.ac.jp Tue Jan 5 08:20:24 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Tue, 5 Jan 2010 22:20:24 +0900 Subject: [BioRuby] Codeml parser In-Reply-To: <20100105103212.GA4584@thebird.nl> References: <20091231141546.GA5770@thebird.nl> <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100104115105.GA21035@thebird.nl> <20100104170904.GA26187@thebird.nl> <20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp> <20100105103212.GA4584@thebird.nl> Message-ID: <20100105132025.B532E1CBC40C@idnmail.gen-info.osaka-u.ac.jp> Hi Pjotr, On Tue, 5 Jan 2010 11:32:12 +0100 Pjotr Prins wrote: > Hi Naohisa, > > First I thought you were kidding. But then I realise you are serious. > > I don't think we need to document every simple class variable/accessor > to accept this source code. That is overkill. If you don't understand > lnL or alpha, don't use it. We are not in the business of documenting > for documenting's sake. Documenting lnL and alpha will be like: > > "Retrieve the lnL value from the Report" > > "Retrieve the alpha value from the Report" > > etc. etc. I don't think we should be doing that. Standard 1-to-1 > relations are obvious and don't need lots of text in the code base. Even just one word is OK, e.g. "lnL", "alpha". But no RDoc is not allowed. Ideally, it may be really great if well informative description can help people unfamiliar with Codeml, and this may encourage people beginning to use Codeml with BioRuby. I understand this can not be easily achieved. When writing a new class or largely adding codes, it is also good to implement first with least documentation and later to improve documents gradually. > If someone feels like filling in these obvious statements, fine. It > really goes against my grain. Do we document every single accessor? > Note the previous implementation did no such thing. That code was > accepted fine (and partially written by you). In late 2005, we determined that all methods, attributes, classes, modules, etc. should be documented by using RDoc. Codes written before earlier 2006 may have no RDoc. I'm working to add RDoc in such codes gradually, but not finished yet. > > Details of +buf+ (class, contents, etc) should also be written in RDoc. > > It is recommended to use the style written in the README_DEV.rdoc, or > > the style used in the Ruby source code. > > You mean the contents of the input buffer, which is the content of the > input file? I see many places in Bioruby where no such a thing is > done. Why become strict on this now? If you want a different > descriptive name for the variable - that is fine. Propose me > a better name. No need to change the variable name. I mean I want to clarify that it points contents of the file and not filename. If you think current description is enough apparent, it is OK. > > > def to_a > > > [ @position, @aaref, @probability, @omega ] > > > end > > What is the purpose of the method? > > Access converter. Convenience, really. You can remove it if you > dislike it so much. I use it for testing and to write to a file. Could > be to_s too, but that fixates the format. OK if you feel useful. > > > class PositiveSites < Array > > > > To inherit Array and to create original container class is discouraged. > > In BioRuby, we have deprecated Bio::Features and Bio::References in > > version 1.3.0, although they do not inherit Array but have an array > > in the object. (The classes still exist only for backward compatibility, > > in lib/bio/compat/features.rb and references.rb). > > PositiveSites object has the all the features of a list (ie Array). I > think inheritance is what it should be. It is an is_a relationship. > Adding a @list will just add code. Not only for initialization, but > also for iterators. I only see how we can move backwards from readable > code. Nor is it good OOP practice. Inheritance is not *always* bad, > though I agree it is used too quickly (in general). > > > In this case, except initialize, only a method named "graph" is added. > > I think it is good to add the graph method in the Report class and > > using an Array for storing PositiveSite objects. > > This is awful. The graph is a feature of PositiveSites, and not of the > report *parser*. To keep things simple it is best practise to have > functionality where it belongs. It is good OOP design. Your proposal > means the Report class becomes less obvious in what it is. Look how > clean it is now! I respect your design if the class is not only a container of PositiveSite objects but also having methods doing special things by using relations among two or more objects which is not a simple accumulation of each object's information. > What do other people think on this list. I am at a disadvantage here. > > I would like this code accepted in Bioruby, so other people can use > it. I disagree with most of above 'criticism'. I certainly balk at the > last non-OOP ones. This is not the first time I am really unhappy. I > can't believe how much trouble I have to go to for a simple class, > which, as it happens, has a perfectly acceptable implementation by > most measures. > > Pj. > Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From ngoto at gen-info.osaka-u.ac.jp Tue Jan 5 08:28:28 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Tue, 5 Jan 2010 22:28:28 +0900 Subject: [BioRuby] Clustal ALN writer In-Reply-To: <20100105123902.GA10823@thebird.nl> References: <20100105123902.GA10823@thebird.nl> Message-ID: <20100105132829.27F7B1CBC401@idnmail.gen-info.osaka-u.ac.jp> Hi Pjotr, There is already Bio::Alignment#output_clustal method. It is implemented in Bio::Alignment::Output module. http://bioruby.org/rdoc/classes/Bio/Alignment/Output.html#M000092 Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Tue, 5 Jan 2010 13:39:02 +0100 Pjotr Prins wrote: > I propose to write an ALN output writer. ALN files show aligned > sequences with additional lines of information (like a match line). I > want to use it to output PAML positive selection sites. This is > the idea: > > > SEQ1 alignment 1... > SEQ2 alignment 2... > ...*.:*....*** (match line) > ...*....*..... (pos. sel. line) > > Do we want such ALN output (I think it is allowed), and can we allow > for the additional output. I have a proposed interface here: > > http://github.com/pjotrp/bioruby/commit/7f320781039b56aee991ab72404655fae210e2cb > > I notice ClustalW.to_fasta has been obsoleted. But we don't have > to_aln yet, and we need to allow adding match_lines and other > information. > > Pj. > From pjotr.public14 at thebird.nl Tue Jan 5 12:04:34 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 5 Jan 2010 18:04:34 +0100 Subject: [BioRuby] Codeml parser In-Reply-To: <20100105132025.B532E1CBC40C@idnmail.gen-info.osaka-u.ac.jp> References: <20091231141546.GA5770@thebird.nl> <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100104115105.GA21035@thebird.nl> <20100104170904.GA26187@thebird.nl> <20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp> <20100105103212.GA4584@thebird.nl> <20100105132025.B532E1CBC40C@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20100105170434.GB13498@thebird.nl> Hi Naohisa, Thanks for clarifying. I am happy now. Pj. From pjotr.public14 at thebird.nl Tue Jan 5 12:09:25 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 5 Jan 2010 18:09:25 +0100 Subject: [BioRuby] Clustal ALN writer In-Reply-To: <20100105132829.27F7B1CBC401@idnmail.gen-info.osaka-u.ac.jp> References: <20100105123902.GA10823@thebird.nl> <20100105132829.27F7B1CBC401@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20100105170925.GA13828@thebird.nl> On Tue, Jan 05, 2010 at 10:28:28PM +0900, Naohisa GOTO wrote: > Hi Pjotr, > > There is already Bio::Alignment#output_clustal method. > It is implemented in Bio::Alignment::Output module. > > http://bioruby.org/rdoc/classes/Bio/Alignment/Output.html#M000092 I missed that. Still it has no functionality for adding the match_line, nor for adding extra information lines. Can I modify this to give this method an optional parameter (list of String) for this? The Alignment class is not aware of 'imported' match lines (it is Clustal specific in Bioruby at this stage). How do you suppose we can do this so I can generate the ALN with multiple match lines? Pj. From ngoto at gen-info.osaka-u.ac.jp Tue Jan 5 22:31:25 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 6 Jan 2010 12:31:25 +0900 Subject: [BioRuby] Clustal ALN writer In-Reply-To: <20100105170925.GA13828@thebird.nl> References: <20100105123902.GA10823@thebird.nl> <20100105132829.27F7B1CBC401@idnmail.gen-info.osaka-u.ac.jp> <20100105170925.GA13828@thebird.nl> Message-ID: <20100106033125.754941CBC3D4@idnmail.gen-info.osaka-u.ac.jp> Hi, On Tue, 5 Jan 2010 18:09:25 +0100 Pjotr Prins wrote: > On Tue, Jan 05, 2010 at 10:28:28PM +0900, Naohisa GOTO wrote: > > Hi Pjotr, > > > > There is already Bio::Alignment#output_clustal method. > > It is implemented in Bio::Alignment::Output module. > > > > http://bioruby.org/rdoc/classes/Bio/Alignment/Output.html#M000092 > > I missed that. Still it has no functionality for adding the > match_line, nor for adding extra information lines. Can I modify this > to give this method an optional parameter (list of String) for this? > > The Alignment class is not aware of 'imported' match lines (it is Clustal > specific in Bioruby at this stage). The output_clustal method gets an argument named "options" as a Hash. The match line can be altered by any given string with an option. alignment.output_clustal(:match_line => str) I'm very sorry for incomplete documentation. It was first written in 2003, and documents were added after 2005 but still incomplete. Bio::Alignment#match_line method is the match line calculation method with the same algorithm as ClustalW. > How do you suppose we can do this so I can generate the ALN with > multiple match lines? I'm afraid this is not regarded as Clustal format. Of course, it is technically easy to add such function. There may be many private extensions of Clustal format. I think this is OK because Clustal format is rough, although this makes hard to validate Clustal format. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From pjotr.public14 at thebird.nl Wed Jan 6 03:07:10 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Wed, 6 Jan 2010 09:07:10 +0100 Subject: [BioRuby] Clustal ALN writer In-Reply-To: <20100106033125.754941CBC3D4@idnmail.gen-info.osaka-u.ac.jp> References: <20100105123902.GA10823@thebird.nl> <20100105132829.27F7B1CBC401@idnmail.gen-info.osaka-u.ac.jp> <20100105170925.GA13828@thebird.nl> <20100106033125.754941CBC3D4@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20100106080710.GA23141@thebird.nl> On Wed, Jan 06, 2010 at 12:31:25PM +0900, Naohisa GOTO wrote: > > How do you suppose we can do this so I can generate the ALN with > > multiple match lines? > > I'm afraid this is not regarded as Clustal format. > Of course, it is technically easy to add such function. > > There may be many private extensions of Clustal format. > I think this is OK because Clustal format is rough, > although this makes hard to validate Clustal format. Standards are vague. EMBOSS does not even mention the match line, but as ClustalW generates it we assume it is a 'standard'. I think most parsers basically ignore lines starting with white space. So multiple 'match lines' should normally work. Many standards in bioinformatics evolve from use - maybe my idea will become a standard one day ;-). I think it is a nice feature to have. I'll add a warning that one should use it with caution. BTW the ALN-writer should really live in its own class/module, similar to the current layout for the 'Report' class (which in reality is an ALN parser, or ALN-reader). It is no surprise I did not find either of them when I was looking for an implementation. OK, I'll cook something up in a separate git branch. Pj. From mail at michaelbarton.me.uk Wed Jan 6 11:58:01 2010 From: mail at michaelbarton.me.uk (Michael Barton) Date: Wed, 6 Jan 2010 11:58:01 -0500 Subject: [BioRuby] Codeml parser In-Reply-To: <4c7507a71001050353v4e4654abn79f9cbcaeef262e2@mail.gmail.com> References: <20091231141546.GA5770@thebird.nl> <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100104115105.GA21035@thebird.nl> <20100104170904.GA26187@thebird.nl> <20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp> <20100105103212.GA4584@thebird.nl> <4c7507a71001050353v4e4654abn79f9cbcaeef262e2@mail.gmail.com> Message-ID: 2010/1/5 Jan Aerts : > It appears that the pre-hackathon bioruby meeting will be very useful :-) > Why don't we use that time to focus on the bit-more-distant future of > bioruby: bioruby 2.0? We could discuss what it should look like without > having to worry about backward compatibility. I second what Jan has suggested about the direction of BioRuby and version 2.0. As Ruby becomes more popular a programming language in bioinformatics it might be expected that BioRuby will receive more and more contributions. Prior to BioRuby 2.0 might be a nice time to discuss how BioRuby will grow and be organised as it increases in size. Topics: > * documentation style (I happen to agree with Naohisa on that) > * class hierarchy: how would we organize the information if we had to start > from scratch? (maybe we should follow bioperl's lead with a Root class?) > * coding style > * general interface decisions > * ... > > jan. > > PS: Still don't know if I can make it to Japan. Will know this afternoon > (broken foot might interfere...) > > 2010/1/5 Pjotr Prins > >> Hi Naohisa, >> >> First I thought you were kidding. But then I realise you are serious. >> >> I don't think we need to document every simple class variable/accessor >> to accept this source code. That is overkill. If you don't understand >> lnL or alpha, don't use it. We are not in the business of documenting >> for documenting's sake. ?Documenting lnL and alpha will be like: >> >> "Retrieve the lnL value from the Report" >> >> "Retrieve the alpha value from the Report" >> >> etc. etc. I don't think we should be doing that. Standard 1?to-1 >> relations are obvious and don't need lots of text in the code base. >> >> If someone feels like filling in these obvious statements, fine. It >> really goes against my grain. Do we document every single accessor? >> Note the previous implementation did no such thing. That code was >> accepted fine (and partially written by you). >> >> > Details of +buf+ (class, contents, etc) should also be written in RDoc. >> > It is recommended to use the style written in the README_DEV.rdoc, or >> > the style used in the Ruby source code. >> >> You mean the contents of the input buffer, which is the content of the >> input file? I see many places in Bioruby where no such a thing is >> done. ?Why become strict on this now? If you want a different >> descriptive name for the variable - that is fine. Propose me >> a better name. >> >> > > ? ? ?def to_a >> > > ? ? ? ?[ @position, @aaref, @probability, @omega ] >> > > ? ? ?end >> > What is the purpose of the method? >> >> Access converter. Convenience, really. You can remove it if you >> dislike it so much. I use it for testing and to write to a file. Could >> be to_s too, but that fixates the format. >> >> > > ? ?class PositiveSites < Array >> > >> > To inherit Array and to create original container class is discouraged. >> > In BioRuby, we have deprecated Bio::Features and Bio::References in >> > version 1.3.0, although they do not inherit Array but have an array >> > in the object. (The classes still exist only for backward compatibility, >> > in lib/bio/compat/features.rb and references.rb). >> >> PositiveSites object has the all the features of a list (ie Array). I >> think inheritance is what it should be. It is an is_a relationship. >> Adding a @list will just add code. Not only for initialization, but >> also for iterators. I only see how we can move backwards from readable >> code. Nor is it good OOP practice. Inheritance is not *always* bad, >> though I agree it is used too quickly (in general). >> >> > In this case, except initialize, only a method named "graph" is added. >> > I think it is good to add the graph method in the Report class and >> > using an Array for storing PositiveSite objects. >> >> This is awful. The graph is a feature of PositiveSites, and not of the >> report *parser*. To keep things simple it is best practise to have >> functionality where it belongs. It is good OOP design. Your proposal >> means the Report class becomes less obvious in what it is. Look how >> clean it is now! >> >> What do other people think on this list. I am at a disadvantage here. >> >> I would like this code accepted in Bioruby, so other people can use >> it. I disagree with most of above 'criticism'. I certainly balk at the >> last non-OOP ones. This is not the first time I am really unhappy. I >> can't believe how much trouble I have to go to for a simple class, >> which, as it happens, has a perfectly acceptable implementation by >> most measures. >> >> Pj. >> >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby >> > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From jan.aerts at gmail.com Fri Jan 8 11:29:07 2010 From: jan.aerts at gmail.com (Jan Aerts) Date: Fri, 8 Jan 2010 16:29:07 +0000 Subject: [BioRuby] Codeml parser In-Reply-To: <4c7507a71001050353v4e4654abn79f9cbcaeef262e2@mail.gmail.com> References: <20091231141546.GA5770@thebird.nl> <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100104115105.GA21035@thebird.nl> <20100104170904.GA26187@thebird.nl> <20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp> <20100105103212.GA4584@thebird.nl> <4c7507a71001050353v4e4654abn79f9cbcaeef262e2@mail.gmail.com> Message-ID: <4c7507a71001080829u179e48f9jfc81b0590fd34872@mail.gmail.com> Maybe it'd be a good idea to start thinking at a level removed from actual code, and create some general design documents first. Maybe we should * describe what we actually want to achieve with the bioruby toolkit: should it be a library foremost, or should it rather be an interface to run other programs (e.g. BLAST)? * make a high-level overview of different parts of bioruby: - how do we handle file formats: are the files actual objects, or do they merely describe a biological entity? E.g. does a FASTA file merit the instantiation of a FASTA object, or is it nothing more than a container of Sequence objects? - how do different parts of the library interact? Should we have a Root class such as in bioperl? What type of class should be used to interface with the world (e.g. file parsing)? What type of class should be used to actually contain the object data (e.g. annotated sequence)? When that's done: come up with general guidelines for coding, e.g. always use keyword-based argument lists or something (just an example). jan. 2010/1/5 Jan Aerts > All, > > It appears that the pre-hackathon bioruby meeting will be very useful :-) > Why don't we use that time to focus on the bit-more-distant future of > bioruby: bioruby 2.0? We could discuss what it should look like without > having to worry about backward compatibility. Topics: > * documentation style (I happen to agree with Naohisa on that) > * class hierarchy: how would we organize the information if we had to start > from scratch? (maybe we should follow bioperl's lead with a Root class?) > * coding style > * general interface decisions > * ... > > jan. > > PS: Still don't know if I can make it to Japan. Will know this afternoon > (broken foot might interfere...) > > 2010/1/5 Pjotr Prins > > Hi Naohisa, >> >> First I thought you were kidding. But then I realise you are serious. >> >> I don't think we need to document every simple class variable/accessor >> to accept this source code. That is overkill. If you don't understand >> lnL or alpha, don't use it. We are not in the business of documenting >> for documenting's sake. Documenting lnL and alpha will be like: >> >> "Retrieve the lnL value from the Report" >> >> "Retrieve the alpha value from the Report" >> >> etc. etc. I don't think we should be doing that. Standard 1?to-1 >> relations are obvious and don't need lots of text in the code base. >> >> If someone feels like filling in these obvious statements, fine. It >> really goes against my grain. Do we document every single accessor? >> Note the previous implementation did no such thing. That code was >> accepted fine (and partially written by you). >> >> > Details of +buf+ (class, contents, etc) should also be written in RDoc. >> > It is recommended to use the style written in the README_DEV.rdoc, or >> > the style used in the Ruby source code. >> >> You mean the contents of the input buffer, which is the content of the >> input file? I see many places in Bioruby where no such a thing is >> done. Why become strict on this now? If you want a different >> descriptive name for the variable - that is fine. Propose me >> a better name. >> >> > > def to_a >> > > [ @position, @aaref, @probability, @omega ] >> > > end >> > What is the purpose of the method? >> >> Access converter. Convenience, really. You can remove it if you >> dislike it so much. I use it for testing and to write to a file. Could >> be to_s too, but that fixates the format. >> >> > > class PositiveSites < Array >> > >> > To inherit Array and to create original container class is discouraged. >> > In BioRuby, we have deprecated Bio::Features and Bio::References in >> > version 1.3.0, although they do not inherit Array but have an array >> > in the object. (The classes still exist only for backward compatibility, >> > in lib/bio/compat/features.rb and references.rb). >> >> PositiveSites object has the all the features of a list (ie Array). I >> think inheritance is what it should be. It is an is_a relationship. >> Adding a @list will just add code. Not only for initialization, but >> also for iterators. I only see how we can move backwards from readable >> code. Nor is it good OOP practice. Inheritance is not *always* bad, >> though I agree it is used too quickly (in general). >> >> > In this case, except initialize, only a method named "graph" is added. >> > I think it is good to add the graph method in the Report class and >> > using an Array for storing PositiveSite objects. >> >> This is awful. The graph is a feature of PositiveSites, and not of the >> report *parser*. To keep things simple it is best practise to have >> functionality where it belongs. It is good OOP design. Your proposal >> means the Report class becomes less obvious in what it is. Look how >> clean it is now! >> >> What do other people think on this list. I am at a disadvantage here. >> >> I would like this code accepted in Bioruby, so other people can use >> it. I disagree with most of above 'criticism'. I certainly balk at the >> last non-OOP ones. This is not the first time I am really unhappy. I >> can't believe how much trouble I have to go to for a simple class, >> which, as it happens, has a perfectly acceptable implementation by >> most measures. >> >> Pj. >> >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby >> > > From pjotr.public14 at thebird.nl Fri Jan 8 12:21:32 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Fri, 8 Jan 2010 18:21:32 +0100 Subject: [BioRuby] Codeml parser In-Reply-To: <4c7507a71001080829u179e48f9jfc81b0590fd34872@mail.gmail.com> References: <20091231141546.GA5770@thebird.nl> <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100104115105.GA21035@thebird.nl> <20100104170904.GA26187@thebird.nl> <20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp> <20100105103212.GA4584@thebird.nl> <4c7507a71001050353v4e4654abn79f9cbcaeef262e2@mail.gmail.com> <4c7507a71001080829u179e48f9jfc81b0590fd34872@mail.gmail.com> Message-ID: <20100108172132.GA28895@thebird.nl> On Fri, Jan 08, 2010 at 04:29:07PM +0000, Jan Aerts wrote: > Maybe it'd be a good idea to start thinking at a level removed from actual > code, and create some general design documents first. Maybe we should > * describe what we actually want to achieve with the bioruby toolkit: should > it be a library foremost, or should it rather be an interface to run other > programs (e.g. BLAST)? I think calling into other programs is a good feature, but should be really split out. Likewise for web services. Both split in terms of objects and directory layout. Currently there is too intertwined functionality. Then there is support for reading and writing standard formats. Then there is extra functionality (not found elsewhere, perhaps). And we have Rails support and the shell. All these should be clearly split out. I don't think we have to choose. We can have it all. Just make sure it sits in the right location. > * make a high-level overview of different parts of bioruby: > - how do we handle file formats: are the files actual objects, or do they > merely describe a biological entity? E.g. does a FASTA file merit the > instantiation of a FASTA object, or is it nothing more than a container of > Sequence objects? > - how do different parts of the library interact? Should we have a Root > class such as in bioperl? What type of class should be used to interface > with the world (e.g. file parsing)? What type of class should be used to > actually contain the object data (e.g. annotated sequence)? > > When that's done: come up with general guidelines for coding, e.g. always > use keyword-based argument lists or something (just an example). These choices are design choices and have to originate in a list of shared 'values'. Because if we don't agree on a value there will always be arguments and disagreement. One value would be 'clear documentation', but this may collide with 'clear source code'. Similarly 'Easy to use code' and 'Concise code' may collide. Or functional choices over OOP. We need to put those values together and rank them in importance. Once the ranking is set we can make easy choices in guidelines. I am writing a type of Manifest. I'll present that in the coming weeks, when I feel I am ready. It is meant for discussion in Japan, and after. Pj. From pjotr.public14 at thebird.nl Mon Jan 11 09:40:41 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 11 Jan 2010 15:40:41 +0100 Subject: [BioRuby] Clustal ALN writer Message-ID: <20100111144041.GA31684@thebird.nl> I have created an colorized HTML alignment file with consensus information and amino acids showing evidence of positive selection (based on PAML output). http://thebird.nl/projects/test_color2.html I did a write up on the implementation at: http://bioruby.open-bio.org/wiki/BIORUBY_ALNCOLOR Enjoy, Pj. From ngoto at gen-info.osaka-u.ac.jp Tue Jan 12 04:29:57 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Tue, 12 Jan 2010 18:29:57 +0900 Subject: [BioRuby] Clustal ALN writer In-Reply-To: <20100111144041.GA31684@thebird.nl> References: <20100111144041.GA31684@thebird.nl> Message-ID: <20100112092957.A16001CBC49E@idnmail.gen-info.osaka-u.ac.jp> Hi, I'm not sure whether the prefix Bio::Html is suitable or not. By the way, I'v tried some of your code in http://github.com/pjotrp/bioruby/blob/color-alignment/ and found potential XSS. a = Bio::Alignment.new a.add_seq('ATCCATGG', '') a.add_seq('ATGCATGC', '') a.add_seq('', 'c') simple = Bio::Html::HtmlAlignment.new(a, :title => '') html = simple.html() File.open('/tmp/xss.html', 'w') { |w| w.print html } For sequences, sequence names, and consensus lines, using CGI.escapeHTML() will always be needed. For the :title, if script users can set the title, it should be escaped, but this prevents script programmers using html tags in the title. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Mon, 11 Jan 2010 15:40:41 +0100 Pjotr Prins wrote: > I have created an colorized HTML alignment file with consensus > information and amino acids showing evidence of positive selection > (based on PAML output). > > http://thebird.nl/projects/test_color2.html > > I did a write up on the implementation at: > > http://bioruby.open-bio.org/wiki/BIORUBY_ALNCOLOR > > Enjoy, > > Pj. > > > > From pjotr.public14 at thebird.nl Tue Jan 12 05:11:32 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 12 Jan 2010 11:11:32 +0100 Subject: [BioRuby] Bioruby HTML output Message-ID: <20100112101132.GC10308@thebird.nl> On Tue, Jan 12, 2010 at 06:29:57PM +0900, Naohisa GOTO wrote: > I'm not sure whether the prefix Bio::Html is suitable or not. Me neither ;). This is something to discuss when we meet. See my write up on partitioning based on functionality or standards. > By the way, I'v tried some of your code in > http://github.com/pjotrp/bioruby/blob/color-alignment/ > and found potential XSS. > > a = Bio::Alignment.new > a.add_seq('ATCCATGG', '') > a.add_seq('ATGCATGC', '') > a.add_seq('', 'c') > simple = Bio::Html::HtmlAlignment.new(a, > :title => '') > html = simple.html() > File.open('/tmp/xss.html', 'w') { |w| w.print html } > > For sequences, sequence names, and consensus lines, > using CGI.escapeHTML() will always be needed. > > For the :title, if script users can set the title, it > should be escaped, but this prevents script programmers > using html tags in the title. Perhaps the HTML generator should escape its output. Though I personally think we should only be worried about security concerns when people *enter* new data on input forms. That is when exploits show up. I can argue that HTML generation should not concern itself with HOW the inputs are presented. One advantage of having a programmer set the 'title' is that he *can* embed HTML. Perhaps escaping HTML is the responsibility of the programmer providing the data. And therefore to the logic that handles input. We have had a similar discussion before. We have to decide to what level *output* code should concern itself with *input* security. I have a feeling that too much of Bioruby classes try to do too much. How do we stay away from cluttering the code? How do we decide that callers should not use HTML and handle security concerns? You write: > a.add_seq('ATCCATGG', '') If a programmer wants that - it is his concern in my opion. If he is concerned about exploits he should not allow it. The Alignment class does not care either. It is none of its business. BTW I fixed a number of PAML::Codeml bugs on this branch. So you can ignore the existing PAML branch. Let's continue with the color coding, assuming you can live with the PAML::Codeml implementation, as it stands. Pj. From donttrustben at gmail.com Tue Jan 12 07:52:42 2010 From: donttrustben at gmail.com (Ben Woodcroft) Date: Tue, 12 Jan 2010 22:52:42 +1000 Subject: [BioRuby] SPTR problem Message-ID: Hi, While parsing all the yeast UniProt txt files I came across a problem with the gn parser - it was returning an array when I expected a hash. Looking at the code the problem seems to be this when statement: when /Name=/,/ORFNames=/ @data['GN'] = gn_uniprot_parser else @data['GN'] = gn_old_parser end http://www.uniprot.org/uniprot/A2P2R3.txt has the problem on the 5th line: GN OrderedLocusNames=YMR084W; So GN line had OrderedLocusNames= but not Name= or ORFNames=, so it didn't use the new parser, like the other entries I came across. Should all 4 possibilities be tested for in the when statement: (Synonyms= being the 4th)? Also, while I'm here: * why does the returned hash have different keys than are in the file? e.g. ORFNames becomes :orfs? * I also found the parsing process for whole genomes quite slow (multiple hours for well annotated ones). * is there any standard way to handle concatenated UniProt files? I wrote my own as it was simple. Thanks, ben -- FYI: My email addresses at unimelb, uq and gmail all redirect to the same place. From ngoto at gen-info.osaka-u.ac.jp Tue Jan 12 21:58:00 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 13 Jan 2010 11:58:00 +0900 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <20100112101132.GC10308@thebird.nl> References: <20100112101132.GC10308@thebird.nl> Message-ID: <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> Hi, On Tue, 12 Jan 2010 11:11:32 +0100 Pjotr Prins wrote: > On Tue, Jan 12, 2010 at 06:29:57PM +0900, Naohisa GOTO wrote: > > I'm not sure whether the prefix Bio::Html is suitable or not. > > Me neither ;). This is something to discuss when we meet. See my > write up on partitioning based on functionality or standards. > > > By the way, I'v tried some of your code in > > http://github.com/pjotrp/bioruby/blob/color-alignment/ > > and found potential XSS. > > > > a = Bio::Alignment.new > > a.add_seq('ATCCATGG', '') > > a.add_seq('ATGCATGC', '') > > a.add_seq('', 'c') > > simple = Bio::Html::HtmlAlignment.new(a, > > :title => '') > > html = simple.html() > > File.open('/tmp/xss.html', 'w') { |w| w.print html } > > > > For sequences, sequence names, and consensus lines, > > using CGI.escapeHTML() will always be needed. > > > > For the :title, if script users can set the title, it > > should be escaped, but this prevents script programmers > > using html tags in the title. > > Perhaps the HTML generator should escape its output. Though I > personally think we should only be worried about security concerns > when people *enter* new data on input forms. That is when exploits > show up. I can argue that HTML generation should not concern itself > with HOW the inputs are presented. One advantage of having a > programmer set the 'title' is that he *can* embed HTML. Perhaps > escaping HTML is the responsibility of the programmer providing the > data. And therefore to the logic that handles input. Even apart from security, sequence names (and sequences) that contain html special characters may not be correctly displayed. For example, sequences with three parameters a, b, and c. % cat test.aln CLUSTAL 2.0.9 multiple sequence alignment 15_c<7 FKNVFTVIKTEFNSHRVKIDSHFHGIWIAGAPPEGTDVYIKWFLQ a>3_511 FKNVMSVIKTEFNSHRVKIDSHFHGIWIAGAPPEGTDVYIKTFLQ ****::*********************************** *** % irb -r bio irb> report = Bio::ClustalW::Report.new(File.read('test.aln')) irb> alignment = report.alignment irb> simple = Bio::Html::HtmlAlignment.new(alignment, :title => 'a,b,c') irb> File.open('abc.html', 'w') { |w| w.print simple.html() } The sequence names were correctly treated by ClustalW 2.0.9, but unexpected representation. This problem can not be solved with input data escaping. If the sequence name "15_c<7" is escaped to "1<a<3_b>5_c<7" before calling the method, text indentation will be broken because of the mismatch of text length and html display width. To solve this, to escape when building the html format by output formatting method will be needed. > We have had a similar discussion before. We have to decide to what > level *output* code should concern itself with *input* security. I > have a feeling that too much of Bioruby classes try to do too much. > How do we stay away from cluttering the code? How do we decide that > callers should not use HTML and handle security concerns? It is difficult not to use HTML-like string which we want to be treated as normal unformatted string but unexpectedly treated as HTML by some programs, e.g. the above example. For security, I'd like to ask security experts. Anyone in this list? I think escaping should be done by formatting layer and should be turned on by default, because: * Only the output formatting layer knows how the input data is processed. * In many cases, the data comes from outside, and we can not expect it is safe enough. * Different escaping rules are needed for different output types, e.g. SQL, html, XML, TeX, GNUPLOT, PostScript, R scripts. Escaping by output methods seems natural, and helps to switch output formats without concerning escaping issues specific to each output format. > You write: > > > a.add_seq('ATCCATGG', '') > > If a programmer wants that - it is his concern in my opion. If he is > concerned about exploits he should not allow it. The Alignment class > does not care either. It is none of its business. The example is extreme case. For security, please ask experts. Apart from the security, I wish ">", "<", "&", etc. can be displayed correctly. I think methods to build HTML format should concern this. > BTW I fixed a number of PAML::Codeml bugs on this branch. So you > can ignore the existing PAML branch. Let's continue with the color > coding, assuming you can live with the PAML::Codeml implementation, > as it stands. When do you want the Bio::PAML::Codeml code to be merged to the blessed bioruby repository? Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From tomoakin at kenroku.kanazawa-u.ac.jp Wed Jan 13 01:57:11 2010 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Wed, 13 Jan 2010 15:57:11 +0900 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> References: <20100112101132.GC10308@thebird.nl> <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> Hi, Happy New Year! > For security, I'd like to ask security experts. > Anyone in this list? Though I am not an expert, in a Japanese blog, http://takagi-hiromitsu.jp/diary/20051227.html Hiromitsu Takagi writes the reason why escaping should be default at the output point, from a security points, which sounds me reasonable, though I do not know an english literature. In addition, > * Different escaping rules are needed for different output types, > e.g. SQL, html, XML, TeX, GNUPLOT, PostScript, R scripts. > Escaping by output methods seems natural, and helps to switch > output formats without concerning escaping issues specific > to each output format. this is a good argument. If html tag containing title is necessary, a non-default API that does accept html marked text rather than the normal text should be considered. -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan On 2010/01/13, at 11:58, Naohisa GOTO wrote: > Hi, > > On Tue, 12 Jan 2010 11:11:32 +0100 > Pjotr Prins wrote: > >> On Tue, Jan 12, 2010 at 06:29:57PM +0900, Naohisa GOTO wrote: >>> I'm not sure whether the prefix Bio::Html is suitable or not. >> >> Me neither ;). This is something to discuss when we meet. See my >> write up on partitioning based on functionality or standards. >> >>> By the way, I'v tried some of your code in >>> http://github.com/pjotrp/bioruby/blob/color-alignment/ >>> and found potential XSS. >>> >>> a = Bio::Alignment.new >>> a.add_seq('ATCCATGG', '') >>> a.add_seq('ATGCATGC', '') >>> a.add_seq('', 'c') >>> simple = Bio::Html::HtmlAlignment.new(a, >>> :title => '') >>> html = simple.html() >>> File.open('/tmp/xss.html', 'w') { |w| w.print html } >>> >>> For sequences, sequence names, and consensus lines, >>> using CGI.escapeHTML() will always be needed. >>> >>> For the :title, if script users can set the title, it >>> should be escaped, but this prevents script programmers >>> using html tags in the title. >> >> Perhaps the HTML generator should escape its output. Though I >> personally think we should only be worried about security concerns >> when people *enter* new data on input forms. That is when exploits >> show up. I can argue that HTML generation should not concern itself >> with HOW the inputs are presented. One advantage of having a >> programmer set the 'title' is that he *can* embed HTML. Perhaps >> escaping HTML is the responsibility of the programmer providing the >> data. And therefore to the logic that handles input. > > Even apart from security, sequence names (and sequences) that > contain html special characters may not be correctly displayed. > > For example, sequences with three parameters a, b, and c. > > % cat test.aln > CLUSTAL 2.0.9 multiple sequence alignment > > > 15_c<7 FKNVFTVIKTEFNSHRVKIDSHFHGIWIAGAPPEGTDVYIKWFLQ > a>3_511 FKNVMSVIKTEFNSHRVKIDSHFHGIWIAGAPPEGTDVYIKTFLQ > ****::*********************************** *** > % irb -r bio > irb> report = Bio::ClustalW::Report.new(File.read('test.aln')) > irb> alignment = report.alignment > irb> simple = Bio::Html::HtmlAlignment.new(alignment, :title => > 'a,b,c') > irb> File.open('abc.html', 'w') { |w| w.print simple.html() } > > The sequence names were correctly treated by ClustalW 2.0.9, > but unexpected representation. > > This problem can not be solved with input data escaping. > If the sequence name "15_c<7" is escaped to > "1<a<3_b>5_c<7" before calling the method, > text indentation will be broken because of the mismatch of > text length and html display width. To solve this, to > escape when building the html format by output formatting > method will be needed. > >> We have had a similar discussion before. We have to decide to what >> level *output* code should concern itself with *input* security. I >> have a feeling that too much of Bioruby classes try to do too much. >> How do we stay away from cluttering the code? How do we decide that >> callers should not use HTML and handle security concerns? > > It is difficult not to use HTML-like string which we want > to be treated as normal unformatted string but unexpectedly > treated as HTML by some programs, e.g. the above example. > > For security, I'd like to ask security experts. > Anyone in this list? > > I think escaping should be done by formatting layer and > should be turned on by default, because: > * Only the output formatting layer knows how the input data > is processed. > * In many cases, the data comes from outside, and we can not > expect it is safe enough. > * Different escaping rules are needed for different output types, > e.g. SQL, html, XML, TeX, GNUPLOT, PostScript, R scripts. > Escaping by output methods seems natural, and helps to switch > output formats without concerning escaping issues specific > to each output format. > >> You write: >> >>> a.add_seq('ATCCATGG', '') >> >> If a programmer wants that - it is his concern in my opion. If he is >> concerned about exploits he should not allow it. The Alignment class >> does not care either. It is none of its business. > > The example is extreme case. For security, please ask experts. > Apart from the security, I wish ">", "<", "&", etc. can be > displayed correctly. I think methods to build HTML format > should concern this. > >> BTW I fixed a number of PAML::Codeml bugs on this branch. So you >> can ignore the existing PAML branch. Let's continue with the color >> coding, assuming you can live with the PAML::Codeml implementation, >> as it stands. > > When do you want the Bio::PAML::Codeml code to be merged to the > blessed bioruby repository? > > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From pjotr.public14 at thebird.nl Wed Jan 13 02:37:06 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Wed, 13 Jan 2010 08:37:06 +0100 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> References: <20100112101132.GC10308@thebird.nl> <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> Message-ID: <20100113073706.GA25611@thebird.nl> Hi all, OK, I'll adapt the output generator to escape symbols. And I think you are right it belongs in the generator. There are three scenario's really: 1. Output that never contains symbols (sequence) 2. Output that can contain symbols, but should be escaped (descriptions, id's) 3. Output that can contain HTML In my case I have all three. I think with a sequence we can assume the content is a legal string. Escaping is overkill and (if needed) points to a bigger problem. I think we should not clutter the code with (1) - or degrade performance by default. Case (2) yes! case (3), like a title or some text to plug in, we should escape by default, but add a parameter :html_escape == false for the cases the user wants to plug in HTML. OK? Pj. From tomoakin at kenroku.kanazawa-u.ac.jp Wed Jan 13 04:44:01 2010 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Wed, 13 Jan 2010 18:44:01 +0900 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <20100113073706.GA25611@thebird.nl> References: <20100112101132.GC10308@thebird.nl> <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> <20100113073706.GA25611@thebird.nl> Message-ID: <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> Hi, > I think with a sequence we can assume the content is a legal string. > Escaping is overkill and (if needed) points to a bigger problem. I > think we should not clutter the code with (1) - or degrade performance > by default. If we are talking on Bio::Html::HtmlAlignment, it is better to escape even for sequence or matchlines to make the class more independent of the implementation of alignment class. Note that sim4 uses >>>...>>> in its matchline, and a future intron aware amino acid alignment processing program might use special characters to indicate introns. If the performance is really a problem and it is in Bio::Alignment::Output, and the constructor guarantees that there is no special characters, then the escape may be skipped. Escaping everything is the default simple program structure and removing that process is a kind of optimization with some programming effort to guarantee its validity without escaping. -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan On 2010/01/13, at 16:37, Pjotr Prins wrote: > Hi all, > > OK, I'll adapt the output generator to escape symbols. And I think > you are right it belongs in the generator. There are three scenario's > really: > > 1. Output that never contains symbols (sequence) > 2. Output that can contain symbols, but should be escaped > (descriptions, id's) > 3. Output that can contain HTML > > In my case I have all three. > > I think with a sequence we can assume the content is a legal string. > Escaping is overkill and (if needed) points to a bigger problem. I > think we should not clutter the code with (1) - or degrade performance > by default. > > Case (2) yes! > > case (3), like a title or some text to plug in, we should escape by > default, but add a parameter :html_escape == false for the cases > the user > wants to plug in HTML. > > OK? > > Pj. > From pjotr.public14 at thebird.nl Fri Jan 15 09:00:59 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Fri, 15 Jan 2010 15:00:59 +0100 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> References: <20100112101132.GC10308@thebird.nl> <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> <20100113073706.GA25611@thebird.nl> <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> Message-ID: <20100115140059.GA24948@thebird.nl> On second thought, escaping is less obvious than I thought. I can escape all generated HTML, but that leaves no way to customize the output. Say I want to include an href in a sequence descriptor - which is a fairly typical requirement - that would be disabled. Likewise if someone wants to customize the title or footer - or even the information on the match_line. The problem here is that we are defining use - forcing the generated HTML into a straight jacket by adding business logic. Are we really telling our users not to use HTML in sequence descriptors, even if it is tied to one type of output? I don't like it. I am going to add a 'master' switch for escaping of HTML. The default will be with escaping. Pj. On Wed, Jan 13, 2010 at 06:44:01PM +0900, Tomoaki NISHIYAMA wrote: > Hi, > >> I think with a sequence we can assume the content is a legal string. >> Escaping is overkill and (if needed) points to a bigger problem. I >> think we should not clutter the code with (1) - or degrade performance >> by default. > > > If we are talking on Bio::Html::HtmlAlignment, > it is better to escape even for sequence or matchlines to make > the class more independent of the implementation of alignment class. > Note that sim4 uses >>>...>>> in its matchline, and a future > intron aware amino acid alignment processing program might use > special characters to indicate introns. > > If the performance is really a problem and > it is in Bio::Alignment::Output, and the constructor guarantees > that there is no special characters, then the escape may be skipped. > Escaping everything is the default simple program structure and > removing that process is a kind of optimization with some programming > effort > to guarantee its validity without escaping. > -- > Tomoaki NISHIYAMA > > Advanced Science Research Center, > Kanazawa University, > 13-1 Takara-machi, > Kanazawa, 920-0934, Japan > > > On 2010/01/13, at 16:37, Pjotr Prins wrote: > >> Hi all, >> >> OK, I'll adapt the output generator to escape symbols. And I think >> you are right it belongs in the generator. There are three scenario's >> really: >> >> 1. Output that never contains symbols (sequence) >> 2. Output that can contain symbols, but should be escaped >> (descriptions, id's) >> 3. Output that can contain HTML >> >> In my case I have all three. >> >> I think with a sequence we can assume the content is a legal string. >> Escaping is overkill and (if needed) points to a bigger problem. I >> think we should not clutter the code with (1) - or degrade performance >> by default. >> >> Case (2) yes! >> >> case (3), like a title or some text to plug in, we should escape by >> default, but add a parameter :html_escape == false for the cases the >> user >> wants to plug in HTML. >> >> OK? >> >> Pj. >> > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From ngoto at gen-info.osaka-u.ac.jp Fri Jan 15 12:19:12 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Sat, 16 Jan 2010 02:19:12 +0900 Subject: [BioRuby] SPTR problem In-Reply-To: References: Message-ID: <20100115171913.4396B1CBC40C@idnmail.gen-info.osaka-u.ac.jp> Hi, On Tue, 12 Jan 2010 22:52:42 +1000 Ben Woodcroft wrote: > Hi, > > While parsing all the yeast UniProt txt files I came across a problem with > the gn parser - it was returning an array when I expected a hash. Looking at > the code the problem seems to be this when statement: > > when /Name=/,/ORFNames=/ > @data['GN'] = gn_uniprot_parser > else > @data['GN'] = gn_old_parser > end > > http://www.uniprot.org/uniprot/A2P2R3.txt has the problem on the 5th line: > > GN OrderedLocusNames=YMR084W; > > So GN line had OrderedLocusNames= but not Name= or ORFNames=, so it didn't > use the new parser, like the other entries I came across. Should all 4 > possibilities be tested for in the when statement: (Synonyms= being the > 4th)? It seems to be a bug. Perhaps there were no (or very few) entries which only had OrderedLocusNames= when the code was first written in 2005. The commit Id in git was b5c3342437ed698f215a87ea72c6cabf0575709d. The GN format was changed in UniProtKB release 2.0 of 05-Jul-2004. The document http://www.uniprot.org/docs/sp_news.htm says: | The new format of the GN line is: | | GN Name=; Synonyms=[, ...]; OrderedLocusNames=[, ...]; | GN ORFNames=[, ...]; | | None of the above four tokens are mandatory. But a "Synonyms" token can only be present if there is a "Name" token. You are right the 4 possibilities should be considered. "Synonyms" can be eliminated, but it may be safe to be included. > Also, while I'm here: > * why does the returned hash have different keys than are in the file? e.g. > ORFNames becomes :orfs? I don't know. Now, I think using the same names as described in the original entries may be preferred, too. > * I also found the parsing process for whole genomes quite slow (multiple > hours for well annotated ones). Please use profiler to find bottlenecks. % ruby -rprofile xxx.rb > * is there any standard way to handle concatenated UniProt files? I wrote my > own as it was simple. What type of "concatenated" do you mean? For simple concatenation, for example, original file distributed from UniProt FTP site, Bio::FlatFile can be used. ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz (please gunzip before reading!) ff = Bio::FlatFile.open("uniprot_sprot.dat") ff.each do |e| puts e.entry_id end > > Thanks, > ben Thank you. -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From tomoakin at kenroku.kanazawa-u.ac.jp Sat Jan 16 00:36:02 2010 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Sat, 16 Jan 2010 14:36:02 +0900 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <20100115140059.GA24948@thebird.nl> References: <20100112101132.GC10308@thebird.nl> <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> <20100113073706.GA25611@thebird.nl> <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> Message-ID: <4B515042.7020204@kenroku.kanazawa-u.ac.jp> Hi, Pjotr Prins wrote: > On second thought, escaping is less obvious than I thought. I can > escape all generated HTML, but that leaves no way to customize the > output. Say I want to include an href in a sequence descriptor - which > is a fairly typical requirement - that would be disabled. I agree this. Having a link to original sequence on the name is usually good idea. > I am going to add a 'master' switch for escaping of HTML. The default > will be with escaping. How do you think to test if the object responds to to_html and then call to_html else pass to escapeHTML. The object may internally plain text and htmlized text or plain text plus link information or just the plain text but cares how is output as html inline element. If properly imlemented, it can generate a link from "gi|112233|..." within a text and cache for the converted result. The object can also simply pass the user supplied html. I think it is a predictable use that user supplied sequence be aligned with sequences obtained from databases. Isn't it better to be able to regard user supplied text as a simple text but the sequence from databases having proper link? This may not be simple with a master switch. From pjotr.public14 at thebird.nl Sat Jan 16 03:30:41 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sat, 16 Jan 2010 09:30:41 +0100 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <4B515042.7020204@kenroku.kanazawa-u.ac.jp> References: <20100112101132.GC10308@thebird.nl> <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> <20100113073706.GA25611@thebird.nl> <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> Message-ID: <20100116083041.GA2663@thebird.nl> On Sat, Jan 16, 2010 at 02:36:02PM +0900, Tomoaki NISHIYAMA wrote: > > I am going to add a 'master' switch for escaping of HTML. The default > > will be with escaping. > > How do you think to test if the object responds to to_html > and then call to_html else pass to escapeHTML. In this case the object to convert to HTML is a String and part of Bio::Alignment. Later implementations of Bio::Alignment could use a Bio::Sequence.id (or something Naohisa wrote me). It would mean we would have to create a Bio::Sequence::Descriptor object, which would contain several specialistic 'output' generators. This is a recurrent idea we need to discuss. I think *all* HTML based stuff should be in its own objects - and its own tree (I have created bio/output/html for that purpose). I think it is a bad idea to clutter regular BioRuby code with HTML specific stuff. Likewise for other outputs, as you pointed out, like plotting. Output should live in bio/lib/output/html bio/lib/output/plot bio/lib/output/gtk bio/lib/output/rails (perhaps) (etc) that way display code never pollutes the simple Bio::Sequence object, for example. You'll get Bio::Html::Sequence for that - or my preferred naming Bio::HtmlSequence. Now if Bio::HtmlSequence could be plugged into Bio::Alignment - the latter would not care - and we could adapt the HtmlSequence info to show embedded hrefs. That would be the proper way to handle it. No testing of methods (like to_html), but use the object structure to define what is supported (and not). Until we implement that (get Bio::Alignment to support arbitrary Sequence objects) I think the master switch is fine. I have updated my branch. Default behaviour is escaping. If a user (like me) wants it otherwise, it is allowed. Pj. From tomoakin at kenroku.kanazawa-u.ac.jp Sun Jan 17 00:12:35 2010 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Sun, 17 Jan 2010 14:12:35 +0900 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <20100116083041.GA2663@thebird.nl> References: <20100112101132.GC10308@thebird.nl> <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> <20100113073706.GA25611@thebird.nl> <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> <20100116083041.GA2663@thebird.nl> Message-ID: <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> Hi, On 2010/01/16, at 17:30, Pjotr Prins wrote: > On Sat, Jan 16, 2010 at 02:36:02PM +0900, Tomoaki NISHIYAMA wrote: >>> I am going to add a 'master' switch for escaping of HTML. The >>> default >>> will be with escaping. >> >> How do you think to test if the object responds to to_html >> and then call to_html else pass to escapeHTML. > > In this case the object to convert to HTML is a String and part of > Bio::Alignment. Later implementations of Bio::Alignment could use a > Bio::Sequence.id (or something Naohisa wrote me). It would mean we > would have to create a Bio::Sequence::Descriptor object, which would > contain several specialistic 'output' generators. For the meanwhile I don't expect that sophisticated mechanism to automatically generate proper HTML, but simply add a mean to distinguish what should be escaped as a normal course and what is specifically prepared as html by the user. A user can write: class HTMLString < String def to_html self end end a = Bio::Alignment.new a.add_seq('ATCCATGG', HTMLString.new('a')) # this is html under the responsibility of the programmer a.add_seq('ATGCATGC', '') # this is not html; don't care on '<', or '>' simple = Bio::Html::HtmlAlignment.new(a, :title => HTMLString.new('A fancy HTML title')) html = simple.html() If Bio::Alignment does not force the object given to be String, such code should be possible without the change in Bio::Alignment, and only the HtmlAlignment class and the programmer needs to know it. So, HTML specific code does not need go to regular BioRuby code. > That would be the proper way to handle it. No testing of methods > (like to_html), but use the object structure to define what is > supported (and not). I'm not sure what do you mean by "use the object structure". How do you distinguish a plain text and HTML text? -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan On 2010/01/16, at 17:30, Pjotr Prins wrote: > On Sat, Jan 16, 2010 at 02:36:02PM +0900, Tomoaki NISHIYAMA wrote: >>> I am going to add a 'master' switch for escaping of HTML. The >>> default >>> will be with escaping. >> >> How do you think to test if the object responds to to_html >> and then call to_html else pass to escapeHTML. > > In this case the object to convert to HTML is a String and part of > Bio::Alignment. Later implementations of Bio::Alignment could use a > Bio::Sequence.id (or something Naohisa wrote me). It would mean we > would have to create a Bio::Sequence::Descriptor object, which would > contain several specialistic 'output' generators. > > This is a recurrent idea we need to discuss. > > I think *all* HTML based stuff should be in its own objects - and its > own tree (I have created bio/output/html for that purpose). > > I think it is a bad idea to clutter regular BioRuby code with HTML > specific stuff. Likewise for other outputs, as you pointed out, like > plotting. Output should live in > > bio/lib/output/html > bio/lib/output/plot > bio/lib/output/gtk > bio/lib/output/rails (perhaps) > (etc) > > that way display code never pollutes the simple Bio::Sequence object, > for example. You'll get Bio::Html::Sequence for that - or my > preferred naming Bio::HtmlSequence. > > Now if Bio::HtmlSequence could be plugged into Bio::Alignment - the > latter would not care - and we could adapt the HtmlSequence info to > show embedded hrefs. > > That would be the proper way to handle it. No testing of methods > (like to_html), but use the object structure to define what is > supported (and not). > > Until we implement that (get Bio::Alignment to support arbitrary > Sequence objects) I think the master switch is fine. I have updated > my branch. Default behaviour is escaping. If a user (like me) wants > it otherwise, it is allowed. > > Pj. > From pjotr.public14 at thebird.nl Sun Jan 17 08:54:41 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sun, 17 Jan 2010 14:54:41 +0100 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> References: <20100112101132.GC10308@thebird.nl> <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> <20100113073706.GA25611@thebird.nl> <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> <20100116083041.GA2663@thebird.nl> <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> Message-ID: <20100117135441.GA24341@thebird.nl> Hi Tomoaki, Thanks for you responses. I really appreciate it. On Sun, Jan 17, 2010 at 02:12:35PM +0900, Tomoaki NISHIYAMA wrote: > A user can write: > > class HTMLString < String > def to_html > self > end > end > > a = Bio::Alignment.new > a.add_seq('ATCCATGG', HTMLString.new('a')) There is at least one 'problem' with this approach. This assumes that Bio::Alignment will keep its current implementation. Currently Bio::Alignment stores a list of descriptions, and a list of sequences. As Naohisa wrote me two weeks ago, this is before Bio::Sequence had its own identifier/descriptor. If we redesign Bio::Alignment there is a large chance we will store Bio::Sequence instead of two lists (I, for one, would certainly favour that). The other problem is more about OOP. In your example you say once it is an HTML object (HTMLString) and next you add a specific method for html 'to_html'. Twice it is 'told' that it generates HTML. 'to_html' also implies something of a transformation. We should opt for a different method name (generate_html, perhaps, or html) class HTMLString def html end end The 'responsibility' of the output is with HTMLString. Good. This way an implementation of Bio::Alignment does not need to know about HTML, but still can generate the output, at the user's request. > # this is html under the responsibility of the programmer > > a.add_seq('ATGCATGC', '') > # this is not html; don't care on '<', or '>' > > simple = Bio::Html::HtmlAlignment.new(a, > :title => HTMLString.new('A fancy HTML title')) > html = simple.html() > > If Bio::Alignment does not force the object given to be String, > such code should be possible without the change in Bio::Alignment, > and only the HtmlAlignment class and the programmer needs to know it. > So, HTML specific code does not need go to regular BioRuby code. HTMLAlignment should not care either how the HTML is generated.. It is really up to the container holding the sequence, or description, what the output is. What I don't like about proposed approach is that HTMLAlignment gets an object, needs to check for an 'to_html or html' method (ugly), and if it does not exist, needs to escape the information (by calling the to_s method?). That is a lot of formal checking I need to do for every output generated. >> That would be the proper way to handle it. No testing of methods >> (like to_html), but use the object structure to define what is >> supported (and not). > > I'm not sure what do you mean by "use the object structure". > How do you distinguish a plain text and HTML text? The output is generated by an HTML aware container. We can agree to use one method 'html' method. Create different types of objects: HTMLSequence.html - generates formatted HTML ColorHTMLSequence.html - generates formatted color HTML EscapedHTMLSequence.html - generated escaped native stuff And if someone wanted it, he could create: Sequence.html - generates plain text This would prevent downstream 'checking' of object responsibilities. We can assume the user knows he is going to use HTMLAlignment and therefore we can expect him to pass in a known HTML supported Sequence object. The reason to get the responsibility in the right place is to create as clean as possible code. You really don't want downstream checking of methods. We can further discuss in Japan. At least it is clear we have several options. Pj. > -- > Tomoaki NISHIYAMA > > Advanced Science Research Center, > Kanazawa University, > 13-1 Takara-machi, > Kanazawa, 920-0934, Japan > > > On 2010/01/16, at 17:30, Pjotr Prins wrote: > >> On Sat, Jan 16, 2010 at 02:36:02PM +0900, Tomoaki NISHIYAMA wrote: >>>> I am going to add a 'master' switch for escaping of HTML. The >>>> default >>>> will be with escaping. >>> >>> How do you think to test if the object responds to to_html >>> and then call to_html else pass to escapeHTML. >> >> In this case the object to convert to HTML is a String and part of >> Bio::Alignment. Later implementations of Bio::Alignment could use a >> Bio::Sequence.id (or something Naohisa wrote me). It would mean we >> would have to create a Bio::Sequence::Descriptor object, which would >> contain several specialistic 'output' generators. >> >> This is a recurrent idea we need to discuss. >> >> I think *all* HTML based stuff should be in its own objects - and its >> own tree (I have created bio/output/html for that purpose). >> >> I think it is a bad idea to clutter regular BioRuby code with HTML >> specific stuff. Likewise for other outputs, as you pointed out, like >> plotting. Output should live in >> >> bio/lib/output/html >> bio/lib/output/plot >> bio/lib/output/gtk >> bio/lib/output/rails (perhaps) >> (etc) >> >> that way display code never pollutes the simple Bio::Sequence object, >> for example. You'll get Bio::Html::Sequence for that - or my >> preferred naming Bio::HtmlSequence. >> >> Now if Bio::HtmlSequence could be plugged into Bio::Alignment - the >> latter would not care - and we could adapt the HtmlSequence info to >> show embedded hrefs. >> >> That would be the proper way to handle it. No testing of methods >> (like to_html), but use the object structure to define what is >> supported (and not). >> >> Until we implement that (get Bio::Alignment to support arbitrary >> Sequence objects) I think the master switch is fine. I have updated >> my branch. Default behaviour is escaping. If a user (like me) wants >> it otherwise, it is allowed. >> >> Pj. >> > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From donttrustben at gmail.com Mon Jan 18 21:15:30 2010 From: donttrustben at gmail.com (Ben Woodcroft) Date: Tue, 19 Jan 2010 12:15:30 +1000 Subject: [BioRuby] SPTR problem In-Reply-To: <20100115171913.4396B1CBC40C@idnmail.gen-info.osaka-u.ac.jp> References: <20100115171913.4396B1CBC40C@idnmail.gen-info.osaka-u.ac.jp> Message-ID: Hi, Thanks for the response. embedded. 2010/1/16 Naohisa GOTO > > It seems to be a bug. Perhaps there were no (or very few) entries > which only had OrderedLocusNames= when the code was first written > in 2005. The commit Id in git was b5c3342437ed698f215a87ea72c6cabf0575709d. > I was figuring that. Also, since no actual exception was thrown, errors might not have been noticed. I wrote a patch for this that I've been using internally, but haven't included unit tests. http://github.com/wwood/bioruby/commit/b2f6cb0b Happy to write tests, but you seem to rewrite my patches anyway.. > > The GN format was changed in UniProtKB release 2.0 of 05-Jul-2004. > The document http://www.uniprot.org/docs/sp_news.htm says: > | The new format of the GN line is: > | > | GN Name=; Synonyms=[, ...]; > OrderedLocusNames=[, ...]; > | GN ORFNames=[, ...]; > | > | None of the above four tokens are mandatory. But a "Synonyms" token can > only be present if there is a "Name" token. > > You are right the 4 possibilities should be considered. > "Synonyms" can be eliminated, but it may be safe to be included. > > > Also, while I'm here: > > * why does the returned hash have different keys than are in the file? > e.g. > > ORFNames becomes :orfs? > > I don't know. Now, I think using the same names as described > in the original entries may be preferred, too. > What do you suggest we do about this? > > > * I also found the parsing process for whole genomes quite slow (multiple > > hours for well annotated ones). > > Please use profiler to find bottlenecks. > % ruby -rprofile xxx.rb > I tried to do something like that but in the end found it easier to pre-grep the uniprot file, keeping only the lines relevant to me. There was too many levels of indirection in my code for me to bother tracking it down. > > > * is there any standard way to handle concatenated UniProt files? I wrote > my > > own as it was simple. > > What type of "concatenated" do you mean? > For simple concatenation, for example, original file distributed > from UniProt FTP site, Bio::FlatFile can be used. > > ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz > (please gunzip before reading!) > > ff = Bio::FlatFile.open("uniprot_sprot.dat") > ff.each do |e| > puts e.entry_id > end > More evidence I'm an idiot. Like I needed any. Thanks, ben From pjotr.public14 at thebird.nl Tue Jan 19 05:50:56 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 19 Jan 2010 11:50:56 +0100 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <20100117135441.GA24341@thebird.nl> References: <20100112101132.GC10308@thebird.nl> <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> <20100113073706.GA25611@thebird.nl> <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> <20100116083041.GA2663@thebird.nl> <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> <20100117135441.GA24341@thebird.nl> Message-ID: <20100119105056.GA29525@thebird.nl> Based on Tomoaki's comments I propose the following: The requirements are: A- input objects that know about HTML should generate that B- other input files get escapeHTML(object.to_s) For a container/displayer to recognize object A, object A should have a method to_html: class ObjectA def to_html end end If to_html does not exist to_s is called - and escaped. The principle will go into a mixin for the container class. Everyone OK with this? Pj. From ktym at hgc.jp Tue Jan 19 07:41:31 2010 From: ktym at hgc.jp (Toshiaki Katayama) Date: Tue, 19 Jan 2010 21:41:31 +0900 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <20100119105056.GA29525@thebird.nl> References: <20100112101132.GC10308@thebird.nl> <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> <20100113073706.GA25611@thebird.nl> <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> <20100116083041.GA2663@thebird.nl> <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> <20100117135441.GA24341@thebird.nl> <20100119105056.GA29525@thebird.nl> Message-ID: <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp> Dear Pj and all, I'm sorry that I could not spare enough time to follow this thread but I'd like to add some comments. Firstly, I don't like to use the method name 'to_html' as we already deprecated to use 'to_fasta' because 'to_' is reserved for conversion of the class in Ruby's convention (above two methods just convert String to String). We (Nakao-san and me) are now working to improve our TogoWS service (http://togows.dbcls.jp) by supporting RDF output. I hope to propose a generalized way to achieve this (hopefully, before the BioHackathon 2010 http://hackathon3.dbcls.jp/). Our current attempt is to have an 'output' method in the Bio::DB class and each sub-class implements actual 'output_*' methods relevant to appropriate formats. # This kind of requirements may also be true for classes other than # the Bio::DB (for example, Bio::Sequence, Alignment, Newick classes), # so we may put this interface in the top level class (Bio::Root?), # which does not exist for now, though. In TogoWS, we internally use the BioRuby library, and the URI http://togows.dbcls.jp/entry/exampledb/1/definition is sent to the 'definition' method defined in the Bio::ExampleDB class. Similarly, we can map '.' notation in the following URLs to call output method using their suffix as a format specifier. http://togows.dbcls.jp/entry/exampledb/1.rdf http://togows.dbcls.jp/entry/exampledb/1.fasta Therefore, these can be mapped to output(:rdf) and output(:fasta) method calls to the Bio::ExampleDB class, respectively. All we need to do is to add these methods in every database class comprehensively. I think this is simple enough and beautiful. I'll attach a primitive pseudo code in below. Comments are welcome. Regards, Toshiaki Katayama module Bio class DB def output(format) send("output_#{format.to_s.downcase}") end end end module Bio class ExampleDB < DB # output sequence of the entry in FASTA format def output_fasta ">#{@entry_id} #{@definition}\n#{@sequence}\n" end # output contents of the entry in RDF (N3) format def output_rdf prefix_subject = "http://togows.dbcls.jp/entry/exampledb" prefix_predicate = "http://togows.dbcls.jp/ontology/exampledb" "<#{prefix_subject}/#{@entry_id}>\t<#{prefix_predicate}#definition>\t#{@definition} .\n" + "<#{prefix_subject}/#{@entry_id}>\t<#{prefix_predicate}#sequence>\t#{@sequence} .\n" end # output contents of the entry in HTML format def output_html "

#{@entry_id}

... blah, blah, blah ..." end end end entry = Bio::ExampleDB.new(str) entry.output(:fasta) # => # >ENTRY_ID # atgcatgcatgcatgcatgc entry.output(:rdf) # => # "DEFINITION" . # "atgcatgcatgcatgc" . On 2010/01/19, at 19:50, Pjotr Prins wrote: > Based on Tomoaki's comments I propose the following: > > The requirements are: > > A- input objects that know about HTML should generate that > B- other input files get escapeHTML(object.to_s) > > For a container/displayer to recognize object A, object A should have > a method to_html: > > class ObjectA > def to_html > end > end > > If to_html does not exist to_s is called - and escaped. The principle > will go into a mixin for the container class. > > Everyone OK with this? > > Pj. > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From tomoakin at kenroku.kanazawa-u.ac.jp Tue Jan 19 09:05:17 2010 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Tue, 19 Jan 2010 23:05:17 +0900 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp> References: <20100112101132.GC10308@thebird.nl> <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> <20100113073706.GA25611@thebird.nl> <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> <20100116083041.GA2663@thebird.nl> <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> <20100117135441.GA24341@thebird.nl> <20100119105056.GA29525@thebird.nl> <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp> Message-ID: Hi, > Firstly, I don't like to use the method name 'to_html' as we already > deprecated to use 'to_fasta' because 'to_' is reserved for conversion > of the class in Ruby's convention (above two methods just convert > String to String). I think HTML and String should be actually a different class. There are to_i and to_f for conversion between subclasses of Numeric, yet this isn't denied because the conversion is Numeric to Numeric. a string " aaa" in HTML is "<a href=example.com> aaa</a>" but HTML " aaa" in HTML is " aaa" The return value of to_html should be a different class than String. So, the point is > def output_html > "

#{@entry_id}

... blah, blah, blah ..." > end how to regulate the different behavior of @entry_id. If the nature of entry_id is plain text, that should be escaped. On the other hand sometimes the user may want to use html aware object for whatever purpose (color, link, etc...). When we want to mix them with data supplied from outside, say user input into CGI, those data shall usually be treated as plain text and suppress any interference with html. #!/usr/local/bin/ruby require 'bio' require 'cgi' class Bio::HTMLString < String def to_html self end end def Bio::generate_html(object) if object.respond_to?(:to_html) object.to_html else string = CGI.escapeHTML(object.to_s) #fall back to escaping Bio::HTMLString.new(string) end end p Bio::generate_html(12) p Bio::generate_html(Bio::HTMLString.new(' aaa')) p Bio::generate_html(' aaa') -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan From pjotr.public14 at thebird.nl Tue Jan 19 09:34:22 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 19 Jan 2010 15:34:22 +0100 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp> References: <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> <20100113073706.GA25611@thebird.nl> <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> <20100116083041.GA2663@thebird.nl> <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> <20100117135441.GA24341@thebird.nl> <20100119105056.GA29525@thebird.nl> <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp> Message-ID: <20100119143422.GA1781@thebird.nl> On Tue, Jan 19, 2010 at 09:41:31PM +0900, Toshiaki Katayama wrote: > All we need to do is to add these methods in every database class > comprehensively. > > I think this is simple enough and beautiful. > I'll attach a primitive pseudo code in below. > Comments are welcome. I agree with Tomoaki it is too restrictive. What, indeed, if we want to present the HTML in a different way? The second comment is that I dislike the way the current files like sequence.rb and alignment.rb are mushrooming in size. There is much too much in there, which discourages people from diving in. I believe code should be readable, and easy to understand/digest. Sticking in output 'details', like HTML generation, does not help. I really would like all HTML to be in one sub-tree. Also XML, RDF and whatnot. When it is 'business' logic it should be in database. When it is output transformations it is not 'business' logic any longer. Don't you think the Sequence, or KEGG, object should not care about HTML? Or RDF, or plotting? Those are separate functionalities. They share common access patterns - which are part of the DB class. Finally, why not use method names? What is the added value of output(:html) over output_html Pj. From ktym at hgc.jp Tue Jan 19 10:33:30 2010 From: ktym at hgc.jp (Toshiaki Katayama) Date: Wed, 20 Jan 2010 00:33:30 +0900 Subject: [BioRuby] Bioruby HTML output In-Reply-To: References: <20100112101132.GC10308@thebird.nl> <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> <20100113073706.GA25611@thebird.nl> <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> <20100116083041.GA2663@thebird.nl> <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> <20100117135441.GA24341@thebird.nl> <20100119105056.GA29525@thebird.nl> <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp> Message-ID: Nishiyama-san, I couldn't catch what you are trying to do... (maybe because I didn't read throughout the thread) On 2010/01/19, at 23:05, Tomoaki NISHIYAMA wrote: > Hi, > >> Firstly, I don't like to use the method name 'to_html' as we already >> deprecated to use 'to_fasta' because 'to_' is reserved for conversion >> of the class in Ruby's convention (above two methods just convert >> String to String). > > I think HTML and String should be actually a different class. > There are to_i and to_f for conversion between subclasses of Numeric, > yet this isn't denied because the conversion is Numeric to Numeric. > > a string " aaa" in HTML is > "<a href=example.com> aaa</a>" but > HTML " aaa" in HTML is " aaa" > > The return value of to_html should be a different class than String. If the method is named as to_html, it might return a HTML object. But, from my view point, a html string is still just a text and escaping the html string is responsibility of a programmer depending on where the string will be used. > > So, the point is >> def output_html >> "

#{@entry_id}

... blah, blah, blah ..." >> end > > how to regulate the different behavior of @entry_id. > If the nature of entry_id is plain text, that should be escaped. > On the other hand sometimes the user may want to use html aware > object for whatever purpose (color, link, etc...). > When we want to mix them with data supplied > from outside, say user input into CGI, those data shall usually > be treated as plain text and suppress any interference with html. I'm talking about a database class and the contents of @entry_id is a string parsed from an flat file entry of that database (not come from outside). > > #!/usr/local/bin/ruby > require 'bio' > require 'cgi' > > class Bio::HTMLString < String > def to_html > self > end > end > def Bio::generate_html(object) > if object.respond_to?(:to_html) > object.to_html > else > string = CGI.escapeHTML(object.to_s) #fall back to escaping > Bio::HTMLString.new(string) > end > end > > p Bio::generate_html(12) > p Bio::generate_html(Bio::HTMLString.new(' aaa')) > p Bio::generate_html(' aaa') Why we need to have this functionality under the Bio name space? Toshiaki > -- > Tomoaki NISHIYAMA > > Advanced Science Research Center, > Kanazawa University, > 13-1 Takara-machi, > Kanazawa, 920-0934, Japan > From ktym at hgc.jp Tue Jan 19 11:21:54 2010 From: ktym at hgc.jp (Toshiaki Katayama) Date: Wed, 20 Jan 2010 01:21:54 +0900 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <20100119143422.GA1781@thebird.nl> References: <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> <20100113073706.GA25611@thebird.nl> <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> <20100116083041.GA2663@thebird.nl> <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> <20100117135441.GA24341@thebird.nl> <20100119105056.GA29525@thebird.nl> <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp> <20100119143422.GA1781@thebird.nl> Message-ID: <14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp> Dear Pj, On 2010/01/19, at 23:34, Pjotr Prins wrote: > On Tue, Jan 19, 2010 at 09:41:31PM +0900, Toshiaki Katayama wrote: >> All we need to do is to add these methods in every database class >> comprehensively. >> >> I think this is simple enough and beautiful. >> I'll attach a primitive pseudo code in below. >> Comments are welcome. > > I agree with Tomoaki it is too restrictive. What, indeed, if we want > to present the HTML in a different way? Hmm. Could you provide me some use cases? Override the output_html method, or, use some template engine to be more generic. > > The second comment is that I dislike the way the current files like > sequence.rb and alignment.rb are mushrooming in size. There is much > too much in there, which discourages people from diving in. I believe > code should be readable, and easy to understand/digest. I can agree some files became too large to learn and/or maintain. But if we try to change the structure of current code base, we need to define a clean criteria beforehand. If we separate files into sub files, people then need to look around the number of files, and it may also slow down the loading speed of the bioruby library. It is a problem of balance. In both cases, lack of excellent guide to read through the bioruby library might be a essential issue. > > Sticking in output 'details', like HTML generation, does not help. > > I really would like all HTML to be in one sub-tree. Also XML, RDF and > whatnot. When it is 'business' logic it should be in database. When it > is output transformations it is not 'business' logic any longer. I'm not sure about HTML but FASTA and RDF, for example, are tightly related to the original database format/contents. So, I proposed to have methods to generate formatted string in each database class. There can be many ways to design OO class trees and to find the best way to represent/abstract things is always a difficult task. At some time, we may do refactoring to produce BioRuby 2.0. Before doing that, we can discuss how to sit all classes/codes cleanly. We may need someone who understand entire structure/contents of the current codebase and willing to design a better one with a good sense. > > Don't you think the Sequence, or KEGG, object should not care about > HTML? Or RDF, or plotting? Those are separate functionalities. They > share common access patterns - which are part of the DB class. Again, we can take both approach. My current proposal is conservative one. Just add these functionalities in each class as the class knows what is in it and what is the best way to represent the contents. If we separate formatting/plotting functionalities into separate class, which might be something like Bio::FlatFile class who knows the header line format of every database entries. Or we may design better one. Anyway, I'm now listening. So, please don't stick with HTML things only and think a global design to which we can plan to migrate. > > Finally, why not use method names? What is the added value of > > output(:html) > > over > > output_html > > Pj. Maybe from esthetics viewpoint? I think it looks better, and, we can easily switch the output format depending on the context without modifying the code. Something like a @media property in CSS (screen, print etc.) in mind. if used_for_semantic_web? format = :rdf # add some codes to do preparation job for SW elsif used_for_blast? format = :fasta # add some codes to do preparation job for blast end # we don't need to change the following line in any context entry.output(format) Toshiaki From pjotr.public14 at thebird.nl Tue Jan 19 15:52:41 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 19 Jan 2010 21:52:41 +0100 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp> References: <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> <20100116083041.GA2663@thebird.nl> <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> <20100117135441.GA24341@thebird.nl> <20100119105056.GA29525@thebird.nl> <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp> <20100119143422.GA1781@thebird.nl> <14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp> Message-ID: <20100119205241.GA7043@thebird.nl> Dear Toshiaki, On Wed, Jan 20, 2010 at 01:21:54AM +0900, Toshiaki Katayama wrote: > > I agree with Tomoaki it is too restrictive. What, indeed, if we want > > to present the HTML in a different way? > > Hmm. Could you provide me some use cases? Think of URL's. One user wants to point a gene ID to NCBI. Another to Swissprot. The container can not be aware of all exceptions - and really should not handle it. > Override the output_html method, or, use some template engine to be > more generic. Maybe those are good mechanisms. In the pre-hackathon we should discuss these points. > I can agree some files became too large to learn and/or maintain. > But if we try to change the structure of current code base, > we need to define a clean criteria beforehand. Yes. > If we separate files into sub files, people then need to look around > the number of files, and it may also slow down the loading speed of > the bioruby library. It is a problem of balance. > > In both cases, lack of excellent guide to read through the bioruby > library might be a essential issue. I think if we structure the files and modules well - and make them small enough - they become self-explaining. That would be my ultimate goal. > At some time, we may do refactoring to produce BioRuby 2.0. > Before doing that, we can discuss how to sit all classes/codes cleanly. > We may need someone who understand entire structure/contents of > the current codebase and willing to design a better one with a good sense. Yes. I agree it is a big step. But we should go for this type of challenge. > > Don't you think the Sequence, or KEGG, object should not care about > > HTML? Or RDF, or plotting? Those are separate functionalities. They > > share common access patterns - which are part of the DB class. > > Again, we can take both approach. My current proposal is conservative one. > Just add these functionalities in each class as the class knows what is in it > and what is the best way to represent the contents. > > If we separate formatting/plotting functionalities into separate class, > which might be something like Bio::FlatFile class who knows the header > line format of every database entries. Or we may design better one. FlatFile has some downsides. It has complicated the libraries. Complication means the modules are less easy to adapt/modify. I think it is slightly over-engineered. Maybe not enough of a problem to take it out, but I hope you see where I am coming from. > Anyway, I'm now listening. So, please don't stick with HTML things only > and think a global design to which we can plan to migrate. I have to spend a day on a writeup. In the coming two weeks. I will try to explain my ideas. > Maybe from esthetics viewpoint? > > I think it looks better, and, we can easily switch the output format > depending on the context without modifying the code. > Something like a @media property in CSS (screen, print etc.) in mind. > > if used_for_semantic_web? > format = :rdf > # add some codes to do preparation job for SW > elsif used_for_blast? > format = :fasta > # add some codes to do preparation job for blast > end > > # we don't need to change the following line in any context > entry.output(format) I see your point. The criticism is that it obfuscates the real intention of the code - i.e. it is not self documenting any longer. But, I guess, this boils down to preferences and acquired tastes. It is not obvious to a newbie, though it may be obvious for someone who is accustomed to Bioruby internals. Which may be good - depending on our basic values. Pj. From ktym at hgc.jp Tue Jan 19 19:49:37 2010 From: ktym at hgc.jp (Toshiaki Katayama) Date: Wed, 20 Jan 2010 09:49:37 +0900 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <20100119205241.GA7043@thebird.nl> References: <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> <20100116083041.GA2663@thebird.nl> <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> <20100117135441.GA24341@thebird.nl> <20100119105056.GA29525@thebird.nl> <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp> <20100119143422.GA1781@thebird.nl> <14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp> <20100119205241.GA7043@thebird.nl> Message-ID: Dear Pj, On 2010/01/20, at 5:52, Pjotr Prins wrote: > Dear Toshiaki, > > On Wed, Jan 20, 2010 at 01:21:54AM +0900, Toshiaki Katayama wrote: >>> I agree with Tomoaki it is too restrictive. What, indeed, if we want >>> to present the HTML in a different way? >> >> Hmm. Could you provide me some use cases? > > Think of URL's. One user wants to point a gene ID to NCBI. Another > to Swissprot. The container can not be aware of all exceptions - and > really should not handle it. Still not clear to me. I supposed to generate a URL string for the href attribute of . However, is there any IDs which needs to be escaped? Or do you mean to embed a HTML snippet in URL? If so, we may need to use URL encoding (URI.escape) instead of the HTML escaping (CGI.escapeHTML). > >> Override the output_html method, or, use some template engine to be >> more generic. > > Maybe those are good mechanisms. In the pre-hackathon we should > discuss these points. Is there any better replacement for Ruby's CGI library available? Requirements: - separation of the HTML from CGI CGI.escapeHTML looks ugly in terms of the naming convention (CamelCase) and the name space -- why not HTML.escape(string). Moreover, we don't want to require 'cgi' just for escaping a HTML string. - support for templates (separation of logic and presentation) I had used erb and html-template. Sometimes erb is too slow (especially when it contains a nested loop to generate a number of lists or tables). - bundled with Ruby as a standard library Otherwise, we'd better to use Rails as a default environment (from a viewpoint of popularity). > >> I can agree some files became too large to learn and/or maintain. >> But if we try to change the structure of current code base, >> we need to define a clean criteria beforehand. > > Yes. > >> If we separate files into sub files, people then need to look around >> the number of files, and it may also slow down the loading speed of >> the bioruby library. It is a problem of balance. >> >> In both cases, lack of excellent guide to read through the bioruby >> library might be a essential issue. > > I think if we structure the files and modules well - and make them > small enough - they become self-explaining. That would be my ultimate > goal. > >> At some time, we may do refactoring to produce BioRuby 2.0. >> Before doing that, we can discuss how to sit all classes/codes cleanly. >> We may need someone who understand entire structure/contents of >> the current codebase and willing to design a better one with a good sense. > > Yes. I agree it is a big step. But we should go for this type of > challenge. > >>> Don't you think the Sequence, or KEGG, object should not care about >>> HTML? Or RDF, or plotting? Those are separate functionalities. They >>> share common access patterns - which are part of the DB class. >> >> Again, we can take both approach. My current proposal is conservative one. >> Just add these functionalities in each class as the class knows what is in it >> and what is the best way to represent the contents. >> >> If we separate formatting/plotting functionalities into separate class, >> which might be something like Bio::FlatFile class who knows the header >> line format of every database entries. Or we may design better one. > > FlatFile has some downsides. It has complicated the libraries. > Complication means the modules are less easy to adapt/modify. I think > it is slightly over-engineered. Maybe not enough of a problem to take > it out, but I hope you see where I am coming from. > >> Anyway, I'm now listening. So, please don't stick with HTML things only >> and think a global design to which we can plan to migrate. > > I have to spend a day on a writeup. In the coming two weeks. I will > try to explain my ideas. OK, let's discuss about these topics as well, during the pre-hackathon meeting (7th Feb) in Tokyo with other core developers. > >> Maybe from esthetics viewpoint? >> >> I think it looks better, and, we can easily switch the output format >> depending on the context without modifying the code. >> Something like a @media property in CSS (screen, print etc.) in mind. >> >> if used_for_semantic_web? >> format = :rdf >> # add some codes to do preparation job for SW >> elsif used_for_blast? >> format = :fasta >> # add some codes to do preparation job for blast >> end >> >> # we don't need to change the following line in any context >> entry.output(format) > > I see your point. The criticism is that it obfuscates the real > intention of the code - i.e. it is not self documenting any longer. > But, I guess, this boils down to preferences and acquired tastes. It > is not obvious to a newbie, though it may be obvious for someone who > is accustomed to Bioruby internals. Which may be good - depending on > our basic values. > > Pj. Note that, you can still directly use the output_html method in each database class. The output(format) method is prepared just as an abstract interface, which will be useful in the above situation, for example. Therefore, following both cases should return the same result and you can choose the coding style depending on the situation. # case 1 format = :rdf entry.output(format) # case 2 entry.output_rdf You can also check entry.respond_to?(:output_rdf) in both cases. Toshiaki From pjotr.public14 at thebird.nl Wed Jan 20 02:36:44 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Wed, 20 Jan 2010 08:36:44 +0100 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp> References: <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> <20100116083041.GA2663@thebird.nl> <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> <20100117135441.GA24341@thebird.nl> <20100119105056.GA29525@thebird.nl> <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp> <20100119143422.GA1781@thebird.nl> <14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp> Message-ID: <20100120073644.GA11295@thebird.nl> Dear Toshiaki, On Wed, Jan 20, 2010 at 01:21:54AM +0900, Toshiaki Katayama wrote: > > I really would like all HTML to be in one sub-tree. Also XML, RDF and > > whatnot. When it is 'business' logic it should be in database. When it > > is output transformations it is not 'business' logic any longer. > > I'm not sure about HTML but FASTA and RDF, for example, are tightly > related to the original database format/contents. So, I proposed > to have methods to generate formatted string in each database class. > > There can be many ways to design OO class trees and to find the best > way to represent/abstract things is always a difficult task. I wrote a nice alignment HTML output generator. Which also displays PAML output. Currently it is in bio/output/html/htmlalignment.rb and the class is named Bio::Html::Alignment. For the current Bioruby, where do you want to put that? I don't feel it should be cluttering alignment.rb. Naohisa has suggested bio/alignment/output/html/alignment.rb instead. I feel uncomfortable with this. But it is kinda consistent with above, tightly relating it to the alignment object. What do you think of the class name? The code is in my color-alignment branch, see http://github.com/pjotrp/bioruby/tree/color-alignment Is anyone else interested in this type of discussion? We can take it off-list. Pj. From missy at be.to Wed Jan 20 04:17:50 2010 From: missy at be.to (MISHIMA, Hiroyuki) Date: Wed, 20 Jan 2010 18:17:50 +0900 Subject: [BioRuby] trouble on the FASTA.QUAL format (Bio::FastaNumericFormat) Message-ID: <4B56CA3E.8000905@be.to> Hi all, I am using BioRuby 1.4.0., and have a trouble in handling the FASTA.QUAL format using Bio::FastaNumericFormat. Please see the following code: ======================== require 'rubygems' require 'bio' FASTA_QUAL =<<'EOS' >SAMPLE1 30 30 29 42 EOS qual = Bio::FastaNumericFormat.new(FASTA_QUAL) bs = qual.to_biosequence puts bs.output(:raw) ========================= The last line raise an error: ========================= (eval):2:in `__get__seq': undefined method `seq' for # (NoMethodError) from (eval):4:in `seq' from /home/misshie/.gem/ruby/1.8/gems/bio-1.4.0/lib/bio/sequence/format_raw.rb:19:in `output' from /home/misshie/.gem/ruby/1.8/gems/bio-1.4.0/lib/bio/sequence/format.rb:97:in `output' from /home/misshie/.gem/ruby/1.8/gems/bio-1.4.0/lib/bio/sequence/format.rb:172:in `output' from fasta_numeric_format.rb:11 ========================= In the last line, using :fasta, :fasta_numeric etc. make same results. Please let me know if you have ideas to solve this problem. Hiro. -- MISHIMA, Hiroyuki, DDS, Ph.D. COE Research Fellow Department of Human Genetics Nagasaki University Graduate School of Biomedical Sciences From andrew.j.grimm at gmail.com Wed Jan 20 07:09:19 2010 From: andrew.j.grimm at gmail.com (Andrew Grimm) Date: Wed, 20 Jan 2010 23:09:19 +1100 Subject: [BioRuby] Thread-safety of alignment Message-ID: Is alignment intended to be thread-safe in bioruby? If so, should I use the same alignment factory between threads, or a separate one in each thread? Andrew From ngoto at gen-info.osaka-u.ac.jp Wed Jan 20 08:36:29 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 20 Jan 2010 22:36:29 +0900 Subject: [BioRuby] trouble on the FASTA.QUAL format (Bio::FastaNumericFormat) In-Reply-To: <4B56CA3E.8000905@be.to> References: <4B56CA3E.8000905@be.to> Message-ID: <20100120133630.052BF1CBC433@idnmail.gen-info.osaka-u.ac.jp> Hi, This is a bug, and will be fixed. Indeed, Bio::FastaNumericFormat does not contain sequence, and I forgot to take care about calling to_biosequence. For a workaroud, qual = Bio::FastaNumericFormat.new(FASTA_QUAL) bs = Bio::Sequence.new('') bs.quality_scores = qual.data puts bs.output(:fasta_numeric) Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Wed, 20 Jan 2010 18:17:50 +0900 "MISHIMA, Hiroyuki" wrote: > Hi all, > > I am using BioRuby 1.4.0., and have a trouble in handling the FASTA.QUAL > format using Bio::FastaNumericFormat. > > Please see the following code: > ======================== > require 'rubygems' > require 'bio' > > FASTA_QUAL =<<'EOS' > >SAMPLE1 > 30 30 29 42 > EOS > > qual = Bio::FastaNumericFormat.new(FASTA_QUAL) > bs = qual.to_biosequence > puts bs.output(:raw) > ========================= > > The last line raise an error: > > ========================= > (eval):2:in `__get__seq': undefined method `seq' for > # (NoMethodError) > from (eval):4:in `seq' > from > /home/misshie/.gem/ruby/1.8/gems/bio-1.4.0/lib/bio/sequence/format_raw.rb:19:in > `output' > from > /home/misshie/.gem/ruby/1.8/gems/bio-1.4.0/lib/bio/sequence/format.rb:97:in > `output' > from > /home/misshie/.gem/ruby/1.8/gems/bio-1.4.0/lib/bio/sequence/format.rb:172:in > `output' > from fasta_numeric_format.rb:11 > ========================= > > In the last line, using :fasta, :fasta_numeric etc. make same results. > > Please let me know if you have ideas to solve this problem. > > Hiro. > -- > MISHIMA, Hiroyuki, DDS, Ph.D. > COE Research Fellow > Department of Human Genetics > Nagasaki University Graduate School of Biomedical Sciences > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From ngoto at gen-info.osaka-u.ac.jp Wed Jan 20 08:50:45 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 20 Jan 2010 22:50:45 +0900 Subject: [BioRuby] Thread-safety of alignment In-Reply-To: References: Message-ID: <20100120135046.158711CBC511@idnmail.gen-info.osaka-u.ac.jp> Hi, On Wed, 20 Jan 2010 23:09:19 +1100 Andrew Grimm wrote: > Is alignment intended to be thread-safe in bioruby? If so, should I > use the same alignment factory between threads, or a separate one in > each thread? It is not confirmed to be thread-safe, so it is safe to use separate one in each thread. Currently, in BioRuby, manipulating the same object from different threads is not intended. When manipulating the same object from different threads is needed, using mutex is recommended. For library developers, it is encouraged to write thread-safe code if possible, but not mandatory. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > Andrew > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From ktym at hgc.jp Thu Jan 21 09:05:42 2010 From: ktym at hgc.jp (Toshiaki Katayama) Date: Thu, 21 Jan 2010 23:05:42 +0900 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <20100120073644.GA11295@thebird.nl> References: <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> <20100116083041.GA2663@thebird.nl> <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> <20100117135441.GA24341@thebird.nl> <20100119105056.GA29525@thebird.nl> <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp> <20100119143422.GA1781@thebird.nl> <14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp> <20100120073644.GA11295@thebird.nl> Message-ID: <7B739736-1D0D-43E2-89E8-8F6B4DCC3404@hgc.jp> Dear Pj, I looked your code and had a feeling that we should use some template system. If HTML tags are hard coded in the library as you did, it will be very hard to modify them by the user. Besides, what version of the HTML specification did you have in mind? This is my first time to see the

tag is used in the form of

. Is it valid? I also think decorations should be separated to the CSS layer and you should avoid to use the tag, especially when you are trying to distribute your code as a part of the library. As for the file location, I still like the way Naohisa has suggested. Although, I'm not sure the internal node 'output/html' is necessary for 'bio/alignment/output/html/alignment.rb'. Anyway, we need to try every approach to learn pros and cons. With your proposal, we may have a tree like this: -------------------------------------------------- for bio/alignment.rb and bio/db/kegg/compound.rb and bio/db/genbank.rb ... bio/output/html/html_alignment.rb (Bio::Html::Alignment) bio/output/html/html_kegg_compound.rb (Bio::Html::KEGG::COMPOUND) bio/output/html/html_genbank.rb (Bio::Html::GenBank) : bio/output/rdf/rdf_kegg_compound.rb (Bio::RDF::KEGG::COMPOUND) bio/output/rdf/rdf_genbank.rb (Bio::RDF::GenBank) : bio/output/fasta/fasta_genbank.rb (Bio::FASTA::GenBank) bio/output/fasta/fasta_kegg_genes.rb (Bio::FASTA::KEGG::GENES) : bio/output/gff/gff_genbank.rb (Bio::GFF::GenBank) : -------------------------------------------------- apparently, the class names for output formats conflict with existing classes (e.g. Bio::FASTA, Bio::GFF) and we need to look into each sub directories to find which output format is supported for a particular database. If we gather templates of output formats along with the database classes: -------------------------------------------------- for bio/alignment.rb: bio/alignment/alignment.html.erb : for bio/db/kegg/compound.rb: bio/db/kegg/compound/compound.rdf.erb bio/db/kegg/compound/compound.tut.erb bio/db/kegg/compound/compound.html.erb : for bio/db/genbank.rb: bio/db/genbank/genbank.rdf.erb bio/db/genbank/genbank.gff.erb bio/db/genbank/genbank.html.erb bio/db/genbank/genbank.fasta.erb : -------------------------------------------------- However, this is still a desk plan and we need to try more (we already started for RDF). Toshiaki On 2010/01/20, at 16:36, Pjotr Prins wrote: > Dear Toshiaki, > > On Wed, Jan 20, 2010 at 01:21:54AM +0900, Toshiaki Katayama wrote: >>> I really would like all HTML to be in one sub-tree. Also XML, RDF and >>> whatnot. When it is 'business' logic it should be in database. When it >>> is output transformations it is not 'business' logic any longer. >> >> I'm not sure about HTML but FASTA and RDF, for example, are tightly >> related to the original database format/contents. So, I proposed >> to have methods to generate formatted string in each database class. >> >> There can be many ways to design OO class trees and to find the best >> way to represent/abstract things is always a difficult task. > > I wrote a nice alignment HTML output generator. Which also displays PAML > output. Currently it is in bio/output/html/htmlalignment.rb and the > class is named Bio::Html::Alignment. > > For the current Bioruby, where do you want to put that? I don't feel > it should be cluttering alignment.rb. Naohisa has suggested > bio/alignment/output/html/alignment.rb instead. I feel uncomfortable > with this. But it is kinda consistent with above, tightly relating it > to the alignment object. > > What do you think of the class name? > > The code is in my color-alignment branch, see > > http://github.com/pjotrp/bioruby/tree/color-alignment > > Is anyone else interested in this type of discussion? We can take it > off-list. > > Pj. From pjotr.public14 at thebird.nl Thu Jan 21 11:20:49 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Thu, 21 Jan 2010 17:20:49 +0100 Subject: [BioRuby] Proposal: Bioruby modules (the bazaar) Message-ID: <20100121162049.GB31462@thebird.nl> Dear Toshiaki, On Thu, Jan 21, 2010 at 11:05:42PM +0900, Toshiaki Katayama wrote: > I looked your code and had a feeling that we should use some > template system. If HTML tags are hard coded in the library as you > did, it will be very hard to modify them by the user. Aren't we trying to overcomplicate things? This is an HTML generator - in fact it is embedded HTML as I don't provide the , header or body parts. It can just be inserted into Rails, or whatever HTML framework that is out there. Templating is just another abstraction. I don't intend to template engines like Rails. Or, are you here merely referring to using the CGI class (or something like that). I guess I could do that, though I have trouble seeing the benefits. It is just another way of writing HTML statements. > Besides, what version of the HTML specification did you have in > mind? > This is my first time to see the

tag is used in the form of

. Is it valid? Yes. It is, in fact, XHTML. > I also think decorations should be separated to the CSS layer and you should avoid to use the tag, especially when you are trying to distribute your code as a part of the library. We use hard coded colors. I could use CSS, but then you need to provide a CSS file (or I need to hard code the header of the file). That makes it (again) more complicated than necessary. Where do we store the CSS file, how do we make sure the browser finds it? CSS is really to adapt look and feel. If the output is meant to be fixed, why make it flexible? Besides all (future) browsers support the font tag, as used. If that stops we could always adapt that source code. > As for the file location, I still like the way Naohisa has > suggested. Alright. I can move the files, if that was all. However, my colored alignment is not going to make it into Bioruby this way. There is always something wrong with my code, it appears. Now I need to move file locations that have not really been decided on; I need to template HTML - but we haven't decided how and it is questionable; I need to use CSS, though I think it makes things worse for users. Are we really sure you want to reject this code just because it does not live up to everyone's current and future expectations? It may still be useful to someone else, you know, it does not break anything else, and can be improved in the future. Once we decide what we want to achieve. The same really holds to my PAML branch and my GEO branch. Both contain useful utilities for others to use. And now the alignment is the third pending Bioruby branch. Can you imagine my growing frustration? Should this go into Bioruby, or should I start another project, like others have done? Or stick it into my existing biotools or bigbio projects? Just, so I don't have the hassle? The way the Perl people handle it is by having independent modules. Everyone owns his, or her, own module and Perl's CPAN acts more as an aggragator. The advantage is that the environment is more dynamic. And you really don't care what is inside a module. That is up to the maintainer and his/her users. We could create independent BioRuby modules, which have their own git repositories. When a module is nice enough to include in Bioruby make it a git submodule - I use this technique for biolib - it will register in the BioRuby repository. That way Bioruby still controls what goes in a release. However, modules can be maintained for experimental setups or private use. So my modules would go in lib/bio/modules/paml lib/bio/modules/geo lib/bio/modules/htmlalignment each its own git repository. When one of those is 'strong' enough for main line you move it into a different location in the main repository. Modules could even be included in Bioruby releases. What hurts me now is that no one is going to use my code, since I don't have the time to make it perfect, and it is hidden in my experimental Bioruby branches. We should find a way to make 'experimental code' available to the rest of the community. That way we may also 'recruit' help to make the code more perfect. Make it easy to allow external modules to become visible through Bioruby - that is a win-win, as well as a more bazaar-like approach to OSS development. I wonder how many people on this list would contribute code if it was more loosely organised. Pj. From ktym at hgc.jp Thu Jan 21 12:54:24 2010 From: ktym at hgc.jp (Toshiaki Katayama) Date: Fri, 22 Jan 2010 02:54:24 +0900 Subject: [BioRuby] Proposal: Bioruby modules (the bazaar) In-Reply-To: <20100121162049.GB31462@thebird.nl> References: <20100121162049.GB31462@thebird.nl> Message-ID: Dear Pj, I can understand your frustration and I like your idea of the 'module' system, as it reminds me the way how the Linux kernel tree is successfully maintained. > I wonder how many people on this list would contribute code if it was > more loosely organised. Indeed. However, I think our move from cvs to git was already a great step that it opened large opportunity to all those who want to participate in development. Before doing that, "open source" project not always mean "open to join" project. Now, everyone can easily fork the project and release their modified codes as you already done. So, we may able to evaluate from the current situation that how many other people have tried. Anyway, it is still a difficult problem that who will decide and how to decide when to migrate the contributed code into the main tree. It might sound like a excuse, but I'm also suffering from the difficulty. I also have several modules which are not yet contributed to the main tree. For example, my SGE library for BioRuby (http://kanehisa.hgc.jp/~k/sge/) because I'm not sure it is general enough and where it fits. As for the HTML portion, I see your point. * I'd like to hear comments from others. * How people like to render/visualize the BioRuby objects (especially in HTML)? * I didn't mean to use the CGI class for HTML generation (I even don't like that). * The use of

seems invalid in XHTML. See http://www.w3.org/TR/xhtml1/#C_3 P.S. Once, I had developed a mechanism to integrate end-user code snippets in the BioRuby shell, called plugins. I wrote some plugins which render a colored codon table, a formatted summary of sequence properties etc. If those and functions defined in your plugins can be easily accessed by puts Bio.your_function_name(options) or something like that, is it satisfy your needs? If so, we can consider to make a repository for such plugins and bundle them in the BioRuby as well. Regards, Toshiaki Katayama On 2010/01/22, at 1:20, Pjotr Prins wrote: > Dear Toshiaki, > > On Thu, Jan 21, 2010 at 11:05:42PM +0900, Toshiaki Katayama wrote: >> I looked your code and had a feeling that we should use some >> template system. If HTML tags are hard coded in the library as you >> did, it will be very hard to modify them by the user. > > Aren't we trying to overcomplicate things? This is an HTML generator > - in fact it is embedded HTML as I don't provide the , header or > body parts. It can just be inserted into Rails, or whatever HTML > framework that is out there. > > Templating is just another abstraction. I don't intend to template > engines like Rails. > > Or, are you here merely referring to using the CGI class (or something > like that). I guess I could do that, though I have trouble seeing the > benefits. It is just another way of writing HTML statements. > >> Besides, what version of the HTML specification did you have in >> mind? >> This is my first time to see the

tag is used in the form of

. Is it valid? > > Yes. It is, in fact, XHTML. > >> I also think decorations should be separated to the CSS layer and you should avoid to use the tag, especially when you are trying to distribute your code as a part of the library. > > We use hard coded colors. I could use CSS, but then you need to > provide a CSS file (or I need to hard code the header of the file). > That makes it (again) more complicated than necessary. Where do we > store the CSS file, how do we make sure the browser finds it? CSS is > really to adapt look and feel. If the output is meant to be fixed, why > make it flexible? Besides all (future) browsers support the font tag, > as used. If that stops we could always adapt that source code. > >> As for the file location, I still like the way Naohisa has >> suggested. > > Alright. I can move the files, if that was all. > > However, my colored alignment is not going to make it into Bioruby > this way. There is always something wrong with my code, it appears. > Now I need to move file locations that have not really been decided > on; I need to template HTML - but we haven't decided how and it is > questionable; I need to use CSS, though I think it makes things worse > for users. > > Are we really sure you want to reject this code just because it does > not live up to everyone's current and future expectations? It may > still be useful to someone else, you know, it does not break anything > else, and can be improved in the future. Once we decide what we want > to achieve. > > The same really holds to my PAML branch and my GEO branch. Both > contain useful utilities for others to use. And now the alignment is > the third pending Bioruby branch. > > Can you imagine my growing frustration? Should this go into Bioruby, > or should I start another project, like others have done? Or stick it > into my existing biotools or bigbio projects? Just, so I don't have > the hassle? > > The way the Perl people handle it is by having independent modules. > Everyone owns his, or her, own module and Perl's CPAN acts more as an > aggragator. The advantage is that the environment is more dynamic. And > you really don't care what is inside a module. That is up to the > maintainer and his/her users. > > We could create independent BioRuby modules, which have their own git > repositories. When a module is nice enough to include in Bioruby make > it a git submodule - I use this technique for biolib - it will > register in the BioRuby repository. That way Bioruby still controls > what goes in a release. However, modules can be maintained for > experimental setups or private use. So my modules would go in > > lib/bio/modules/paml > lib/bio/modules/geo > lib/bio/modules/htmlalignment > > each its own git repository. > > When one of those is 'strong' enough for main line you move it into a > different location in the main repository. Modules could even be > included in Bioruby releases. > > What hurts me now is that no one is going to use my code, since I > don't have the time to make it perfect, and it is hidden in my > experimental Bioruby branches. We should find a way to make > 'experimental code' available to the rest of the community. That way > we may also 'recruit' help to make the code more perfect. > > Make it easy to allow external modules to become visible through > Bioruby - that is a win-win, as well as a more bazaar-like approach > to OSS development. > > I wonder how many people on this list would contribute code if it was > more loosely organised. > > Pj. From yannick.wurm at unil.ch Thu Jan 21 13:21:40 2010 From: yannick.wurm at unil.ch (Yannick Wurm) Date: Thu, 21 Jan 2010 19:21:40 +0100 Subject: [BioRuby] Proposal: Bioruby modules (the bazaar) In-Reply-To: References: Message-ID: On 21 Jan 2010, at 18:00, bioruby-request at lists.open-bio.org wrote: > re we really sure you want to reject this code just because it does > not live up to everyone's current and future expectations? It may > still be useful to someone else, you know, it does not break anything > else, and can be improved in the future. Once we decide what we want > to achieve. > > What hurts me now is that no one is going to use my code, since I > don't have the time to make it perfect, and it is hidden in my > experimental Bioruby branches. We should find a way to make > 'experimental code' available to the rest of the community. That way > we may also 'recruit' help to make the code more perfect. I agree 100% that enthusiastic bioruby improvements like Pjotr's should be encouraged & given maximal visibility. It's better to have great tools with room for improvement than no tools. (a year or two ago I needed colored html alignments and ended up with an ugly, ugly hack that used t_coffee to generate html output from the alignments I'd generated elsewhere - something like Pjotr's code would have been much more elegant) I also have the feeling that code contributions in general are given more negative than positive feedback on this list. I believe it's a grave mistake because the bioruby community will not grow without passionate users & contibutors and more quality code. just my two cents, yannick -------------------------------------------- yannick . wurm @ unil . ch Ant Genomics, Ecology & Evolution @ Lausanne http://www.unil.ch/dee/page28685_fr.html From pjotr.public14 at thebird.nl Fri Jan 22 03:55:08 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Fri, 22 Jan 2010 09:55:08 +0100 Subject: [BioRuby] Proposal: Bioruby modules (the bazaar) In-Reply-To: References: <20100121162049.GB31462@thebird.nl> Message-ID: <20100122085508.GB12248@thebird.nl> On Fri, Jan 22, 2010 at 02:54:24AM +0900, Toshiaki Katayama wrote: > Dear Pj, > > I can understand your frustration and I like your idea of the > 'module' system, as it reminds me the way how the Linux kernel > tree is successfully maintained. Thinking about it there are other good examples. The R language supports modules in CRAN - similar in many ways to generic Perl CPAN and Ruby's gems. But, on top of CRAN they also have Bioconductor which aggregates Bio related modules. The main benefit is that it pre-packages all Bio related packages and people can load it on the fly. See http://www.bioconductor.org/ We don't want to replace gems - but I think the gem system is too loose for most people, and it requires every module to understand and comply with the gem system. I think Bioruby can play a role here. We can have modules (or plugins, like Rails has) that come either with Bioruby's installation, or get installed on request. If we find a syntax for that it would be great. E.g. Bio::Module.load(:html_alignment) If it is part of Bioruby, pass. Otherwise throw error: "Bio::Module :html_alignment not installed, try Bio::Module.install(:html_alignment)" Bio::Module.install(:html_alignment) will search the definition and install it. Depending on the module it can be installed as a gem, or fetched through git or a tarball (an optional parameter can overrule behaviour). On success one can start as either function will prepare for: html_aln = Bio::Html::Alignment.new('my.aln') The nice thing about this setup is that (1) It is really easy on the user (2) Decouples the module from Bioruby - all issues are between the users and the module maintainer - discussions can still be on the main mailing list (3) Retains some control on what modules are allowed in, an what not (4) Modules can be obsoleted (5) Modules can be updated outside Bioruby's mainline. e.g. Bio::Module.install(:html_alignment,:development=>true) Pj. From tomoakin at kenroku.kanazawa-u.ac.jp Fri Jan 22 04:12:29 2010 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Fri, 22 Jan 2010 18:12:29 +0900 Subject: [BioRuby] Proposal: Bioruby modules (the bazaar) In-Reply-To: References: <20100121162049.GB31462@thebird.nl> Message-ID: <066BB141-7217-4343-85B4-165072A58E06@kenroku.kanazawa-u.ac.jp> Hi, > As for the HTML portion, I see your point. > > * I'd like to hear comments from others. > * How people like to render/visualize the BioRuby objects > (especially in HTML)? > * I didn't mean to use the CGI class for HTML generation (I even > don't like that). Perhaps the way to render the objects depends on both objects and purposes, but if the object has a string representation, just showing them is perhaps a good default. Also defining the way how to represent in HTML or any other format for all classes comprehensively is too laborious as the first step and a way to allow gradual growth of the codebase seems good. It is the way flatfile parser grew to support many formats. Thus, mechanism to do class specific conversion and default conversion for non HTML aware classes is good. Criticism on 'cgi' library for the default conversion CGI.escapeHTML(object.to_s), especially for the name is understandable. There are already criticism on CGI.rb in itself but there are no *standard* alternatives yet. Perhaps we can just copy or rewrite the escapeHTML code and make it any name that fits our purpose. A drawback of having our escapeHTML code is that it could be redundant in many cases where html generation is for CGI, and we cannot get benefit from CGIAlt or any other compatible speedup library on CGI, rewrite or extension with C. But I think this is not a very large problem. Making require 'bio' automatically loading cgi.rb is undesirable. If the html code is not automatically loaded by require 'bio' but loaded only another call require 'bio/html', then I feel 'bio/html' loading cgi.rb is in a reasonable range. Capability to use style instead of directly specifying color and font is desirable since it could reduce the output size, and possibly readability. Nontheless, this is not mandatory and the first implementation with direct specification is ok. -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan On 2010/01/22, at 2:54, Toshiaki Katayama wrote: > Dear Pj, > > I can understand your frustration and I like your idea of the > 'module' system, as it reminds me the way how the Linux kernel > tree is successfully maintained. > >> I wonder how many people on this list would contribute code if it was >> more loosely organised. > > Indeed. > > However, I think our move from cvs to git was already a great step > that it opened large opportunity to all those who want to participate > in development. Before doing that, "open source" project not always > mean "open to join" project. > > Now, everyone can easily fork the project and release their modified > codes as you already done. So, we may able to evaluate from the > current > situation that how many other people have tried. > > Anyway, it is still a difficult problem that who will decide and > how to decide when to migrate the contributed code into the main tree. > It might sound like a excuse, but I'm also suffering from the > difficulty. > I also have several modules which are not yet contributed to the > main tree. > For example, my SGE library for BioRuby (http://kanehisa.hgc.jp/~k/ > sge/) > because I'm not sure it is general enough and where it fits. > > > As for the HTML portion, I see your point. > > * I'd like to hear comments from others. > * How people like to render/visualize the BioRuby objects > (especially in HTML)? > * I didn't mean to use the CGI class for HTML generation (I even > don't like that). > * The use of

seems invalid in XHTML. See http://www.w3.org/TR/ > xhtml1/#C_3 > > > P.S. > Once, I had developed a mechanism to integrate end-user code snippets > in the BioRuby shell, called plugins. I wrote some plugins which > render > a colored codon table, a formatted summary of sequence properties etc. > > If those and functions defined in your plugins can be easily > accessed by > > puts Bio.your_function_name(options) > > or something like that, is it satisfy your needs? > > If so, we can consider to make a repository for such plugins and > bundle > them in the BioRuby as well. > > Regards, > Toshiaki Katayama > > > On 2010/01/22, at 1:20, Pjotr Prins wrote: > >> Dear Toshiaki, >> >> On Thu, Jan 21, 2010 at 11:05:42PM +0900, Toshiaki Katayama wrote: >>> I looked your code and had a feeling that we should use some >>> template system. If HTML tags are hard coded in the library as you >>> did, it will be very hard to modify them by the user. >> >> Aren't we trying to overcomplicate things? This is an HTML generator >> - in fact it is embedded HTML as I don't provide the , >> header or >> body parts. It can just be inserted into Rails, or whatever HTML >> framework that is out there. >> >> Templating is just another abstraction. I don't intend to template >> engines like Rails. >> >> Or, are you here merely referring to using the CGI class (or >> something >> like that). I guess I could do that, though I have trouble seeing >> the >> benefits. It is just another way of writing HTML statements. >> >>> Besides, what version of the HTML specification did you have in >>> mind? >>> This is my first time to see the

tag is used in the form of >>>

. Is it valid? >> >> Yes. It is, in fact, XHTML. >> >>> I also think decorations should be separated to the CSS layer and >>> you should avoid to use the tag, especially when you are >>> trying to distribute your code as a part of the library. >> >> We use hard coded colors. I could use CSS, but then you need to >> provide a CSS file (or I need to hard code the header of the file). >> That makes it (again) more complicated than necessary. Where do we >> store the CSS file, how do we make sure the browser finds it? CSS is >> really to adapt look and feel. If the output is meant to be fixed, >> why >> make it flexible? Besides all (future) browsers support the font >> tag, >> as used. If that stops we could always adapt that source code. >> >>> As for the file location, I still like the way Naohisa has >>> suggested. >> >> Alright. I can move the files, if that was all. >> >> However, my colored alignment is not going to make it into Bioruby >> this way. There is always something wrong with my code, it appears. >> Now I need to move file locations that have not really been decided >> on; I need to template HTML - but we haven't decided how and it is >> questionable; I need to use CSS, though I think it makes things worse >> for users. >> >> Are we really sure you want to reject this code just because it does >> not live up to everyone's current and future expectations? It may >> still be useful to someone else, you know, it does not break anything >> else, and can be improved in the future. Once we decide what we want >> to achieve. >> >> The same really holds to my PAML branch and my GEO branch. Both >> contain useful utilities for others to use. And now the alignment is >> the third pending Bioruby branch. >> >> Can you imagine my growing frustration? Should this go into Bioruby, >> or should I start another project, like others have done? Or stick it >> into my existing biotools or bigbio projects? Just, so I don't have >> the hassle? >> >> The way the Perl people handle it is by having independent modules. >> Everyone owns his, or her, own module and Perl's CPAN acts more as an >> aggragator. The advantage is that the environment is more dynamic. >> And >> you really don't care what is inside a module. That is up to the >> maintainer and his/her users. >> >> We could create independent BioRuby modules, which have their own git >> repositories. When a module is nice enough to include in Bioruby make >> it a git submodule - I use this technique for biolib - it will >> register in the BioRuby repository. That way Bioruby still controls >> what goes in a release. However, modules can be maintained for >> experimental setups or private use. So my modules would go in >> >> lib/bio/modules/paml >> lib/bio/modules/geo >> lib/bio/modules/htmlalignment >> >> each its own git repository. >> >> When one of those is 'strong' enough for main line you move it into a >> different location in the main repository. Modules could even be >> included in Bioruby releases. >> >> What hurts me now is that no one is going to use my code, since I >> don't have the time to make it perfect, and it is hidden in my >> experimental Bioruby branches. We should find a way to make >> 'experimental code' available to the rest of the community. That way >> we may also 'recruit' help to make the code more perfect. >> >> Make it easy to allow external modules to become visible through >> Bioruby - that is a win-win, as well as a more bazaar-like approach >> to OSS development. >> >> I wonder how many people on this list would contribute code if it was >> more loosely organised. >> >> Pj. > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From jan.aerts at gmail.com Fri Jan 22 04:34:43 2010 From: jan.aerts at gmail.com (Jan Aerts) Date: Fri, 22 Jan 2010 09:34:43 +0000 Subject: [BioRuby] Proposal: Bioruby modules (the bazaar) In-Reply-To: References: Message-ID: <4c7507a71001220134j3eecf626y90755ddd919336e4@mail.gmail.com> Hear, hear... Exactly my feelings as well. j. 2010/1/21 Yannick Wurm > On 21 Jan 2010, at 18:00, bioruby-request at lists.open-bio.org wrote: > > > re we really sure you want to reject this code just because it does > > not live up to everyone's current and future expectations? It may > > still be useful to someone else, you know, it does not break anything > > else, and can be improved in the future. Once we decide what we want > > to achieve. > > > > > What hurts me now is that no one is going to use my code, since I > > don't have the time to make it perfect, and it is hidden in my > > experimental Bioruby branches. We should find a way to make > > 'experimental code' available to the rest of the community. That way > > we may also 'recruit' help to make the code more perfect. > > > I agree 100% that enthusiastic bioruby improvements like Pjotr's should be > encouraged & given maximal visibility. > It's better to have great tools with room for improvement than no tools. > (a year or two ago I needed colored html alignments and ended up with an > ugly, ugly hack that used t_coffee to generate html output from the > alignments I'd generated elsewhere - something like Pjotr's code would have > been much more elegant) > > I also have the feeling that code contributions in general are given more > negative than positive feedback on this list. I believe it's a grave mistake > because the bioruby community will not grow without passionate users & > contibutors and more quality code. > > just my two cents, > > yannick > > -------------------------------------------- > yannick . wurm @ unil . ch > Ant Genomics, Ecology & Evolution @ Lausanne > http://www.unil.ch/dee/page28685_fr.html > > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From tomoakin at kenroku.kanazawa-u.ac.jp Fri Jan 22 04:48:20 2010 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Fri, 22 Jan 2010 18:48:20 +0900 Subject: [BioRuby] Proposal: Bioruby modules (the bazaar) In-Reply-To: <20100122085508.GB12248@thebird.nl> References: <20100121162049.GB31462@thebird.nl> <20100122085508.GB12248@thebird.nl> Message-ID: <1B20D151-8246-47B1-8D3B-F319D66FF92F@kenroku.kanazawa-u.ac.jp> Hi, > Bio::Module.load(:html_alignment) What is the benefit over require 'bio/html_alignment' # no autoload by require 'bio' ? > Bio::Module.install(:html_alignment) > > will search the definition and install it. I feel installation is easier from shell like: $ ruby bioruby-inst-module html_alignment but calling the Module.install internally is fine. > (5) Modules can be updated outside Bioruby's mainline. e.g. > Bio::Module.install(:html_alignment,:development=>true) We need to have a mechanism to check the versions between the standard bioruby and the modules. Especially when the mainline bioruby is updated. Different modules perhaps will have different level of dependency on the bioruby code, and update in the main bioruby code sometimes may break the old module. -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan From pjotr.public14 at thebird.nl Fri Jan 22 05:49:00 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Fri, 22 Jan 2010 11:49:00 +0100 Subject: [BioRuby] Proposal: Bioruby modules (the bazaar) In-Reply-To: <1B20D151-8246-47B1-8D3B-F319D66FF92F@kenroku.kanazawa-u.ac.jp> References: <20100121162049.GB31462@thebird.nl> <20100122085508.GB12248@thebird.nl> <1B20D151-8246-47B1-8D3B-F319D66FF92F@kenroku.kanazawa-u.ac.jp> Message-ID: <20100122104900.GB15628@thebird.nl> On Fri, Jan 22, 2010 at 06:48:20PM +0900, Tomoaki NISHIYAMA wrote: >> Bio::Module.load(:html_alignment) > > What is the benefit over > require 'bio/html_alignment' # no autoload by require 'bio' > ? A method allows more checking. I presume the module information will be somewhere in a YAML file in the main tree. Or maintained through git submodules. >> Bio::Module.install(:html_alignment) >> >> will search the definition and install it. > > I feel installation is easier from shell like: > $ ruby bioruby-inst-module html_alignment > but calling the Module.install internally is fine. My example is for an interactive session. You only do it once (I hope). Or when an author says he has updated his module. >> (5) Modules can be updated outside Bioruby's mainline. e.g. >> Bio::Module.install(:html_alignment,:development=>true) > > We need to have a mechanism to check the versions between > the standard bioruby and the modules. Especially when the > mainline bioruby is updated. Different modules perhaps will > have different level of dependency on the bioruby code, and > update in the main bioruby code sometimes may break the old > module. Well. Bioruby should not care. I think you misunderstand the purpose. Modules are *not* to be supported from Bioruby. It is only a mechanism to make them easily available. If things break, they break. That is why it is developmental, or experimental. The modules that are well 'supported' will come inside the distribution. Outside modules are up to the module maintainer. Besides, you don't want to replace gems. If an author wants versioning he can provide a gem (which, again, can be loaded as a Bioruby module). Once a module goes main stream versioning is moot. It just becomes part of the Bioruby tree. When everyone understands this a module can still support versioning. But I think that ought to be done through gems. Pj. From andrew.j.grimm at gmail.com Tue Jan 26 07:12:35 2010 From: andrew.j.grimm at gmail.com (Andrew Grimm) Date: Tue, 26 Jan 2010 23:12:35 +1100 Subject: [BioRuby] Thread-safety of alignment In-Reply-To: <20100120135046.158711CBC511@idnmail.gen-info.osaka-u.ac.jp> References: <20100120135046.158711CBC511@idnmail.gen-info.osaka-u.ac.jp> Message-ID: Hi Naohisa Goto, I tried creating a new factory in each thread, but I sometimes (but not always) have errors. Is the code in http://github.com/agrimm/bioruby-alignment-threading-replication/blob/master/test/test_multithreaded_alignment.rb correct? Does it cause problems for anyone else? Some of the errors I get include the ones seen at http://gist.github.com/286775 It's possible that the issues are caused by problems in tempfile itself (which may have been fixed in August 2009 according to the changelog). Thanks, Andrew On Thu, Jan 21, 2010 at 12:50 AM, Naohisa GOTO wrote: > Hi, > > On Wed, 20 Jan 2010 23:09:19 +1100 > Andrew Grimm wrote: > >> Is alignment intended to be thread-safe in bioruby? If so, should I >> use the same alignment factory between threads, or a separate one in >> each thread? > > It is not confirmed to be thread-safe, so it is safe to use > separate one in each thread. > > Currently, in BioRuby, manipulating the same object from different > threads is not intended. When manipulating the same object from > different threads is needed, using mutex is recommended. > > For library developers, it is encouraged to write thread-safe > code if possible, but not mandatory. > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > >> >> Andrew >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > From ngoto at gen-info.osaka-u.ac.jp Tue Jan 26 10:00:04 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 27 Jan 2010 00:00:04 +0900 Subject: [BioRuby] Thread-safety of alignment In-Reply-To: References: <20100120135046.158711CBC511@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20100126150004.8B4AA1CBC3EC@idnmail.gen-info.osaka-u.ac.jp> Hi Andrew, On Tue, 26 Jan 2010 23:12:35 +1100 Andrew Grimm wrote: > Hi Naohisa Goto, > > I tried creating a new factory in each thread, but I sometimes (but > not always) have errors. Please show ruby version and BioRuby version. % ruby -v % ruby -rbio -e 'puts Bio::BIORUBY_VERSION_ID' (If you are using BioRuby 1.2.1 or earlier, % ruby -rbio -e 'p Bio::BIORUBY_VERSION' ) > Is the code in http://github.com/agrimm/bioruby-alignment-threading-replication/blob/master/test/test_multithreaded_alignment.rb > correct? Does it cause problems for anyone else? The "rescue RuntimeError" in line 15 may hide problems. In my environment, it seems that the RuntimeError is raised in lib/bio/alignment.rb. The error message I observed without the rescue was "alignment result is inconsistent with input data", and output file created by Clustalw was unexpectedly empty. It might be a bug of Tempfile in Ruby, but not sure. With Ruby 1.8.7, errors are observed in some times. % ruby -v ruby 1.8.7 (2010-01-10 patchlevel 249) [i686-linux] ruby 1.8.7 (2009-04-08 patchlevel 160) [i686-linux] ruby 1.8.7 (2008-08-11 patchlevel 72) [i486-linux] With Ruby 1.9.1-p378, no errors when I executed several times. % ruby -v ruby 1.9.1p378 (2010-01-10 revision 26273) [i686-linux] > Some of the errors I get include the ones seen at http://gist.github.com/286775 The message "ERROR: Multiple sequences found with same name (found 0 at least twice)!" is reported by ClustalW, and it indicates incorrect input file sequence names. Maybe two file contents are unexpectedly concatenated or mixed possibly due to a bug of Tempfile, but not sure. > It's possible that the issues are caused by problems in tempfile > itself (which may have been fixed in August 2009 according to the > changelog). Another possibility is resource limits of the machine: the number of child processes, total memory size, etc. If exceeding limits, new child clustalw process could not be started, or running clustalw processes might be killed. This also causes void or truncated result files, and leads to ruby-level errors. Thanks, Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > Thanks, > > Andrew > > On Thu, Jan 21, 2010 at 12:50 AM, Naohisa GOTO > wrote: > > Hi, > > > > On Wed, 20 Jan 2010 23:09:19 +1100 > > Andrew Grimm wrote: > > > >> Is alignment intended to be thread-safe in bioruby? If so, should I > >> use the same alignment factory between threads, or a separate one in > >> each thread? > > > > It is not confirmed to be thread-safe, so it is safe to use > > separate one in each thread. > > > > Currently, in BioRuby, manipulating the same object from different > > threads is not intended. When manipulating the same object from > > different threads is needed, using mutex is recommended. > > > > For library developers, it is encouraged to write thread-safe > > code if possible, but not mandatory. > > > > Naohisa Goto > > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > > >> > >> Andrew > >> _______________________________________________ > >> BioRuby Project - http://www.bioruby.org/ > >> BioRuby mailing list > >> BioRuby at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > > From andrew.j.grimm at gmail.com Tue Jan 26 22:07:18 2010 From: andrew.j.grimm at gmail.com (Andrew Grimm) Date: Wed, 27 Jan 2010 14:07:18 +1100 Subject: [BioRuby] Thread-safety of alignment In-Reply-To: <20100126150004.8B4AA1CBC3EC@idnmail.gen-info.osaka-u.ac.jp> References: <20100120135046.158711CBC511@idnmail.gen-info.osaka-u.ac.jp> <20100126150004.8B4AA1CBC3EC@idnmail.gen-info.osaka-u.ac.jp> Message-ID: Hi Naohisa Goto, On Wed, Jan 27, 2010 at 2:00 AM, Naohisa GOTO wrote: > Hi Andrew, > > On Tue, 26 Jan 2010 23:12:35 +1100 > Andrew Grimm wrote: > >> Hi Naohisa Goto, >> >> I tried creating a new factory in each thread, but I sometimes (but >> not always) have errors. > > Please show ruby version and BioRuby version. > ?% ruby -v > ?% ruby -rbio -e 'puts Bio::BIORUBY_VERSION_ID' > (If you are using BioRuby 1.2.1 or earlier, > ?% ruby -rbio -e 'p Bio::BIORUBY_VERSION' > ) > I'm running ruby 1.8.7 (2008-08-11 patchlevel 72) and bioruby 1.4.0. >> Is the code in http://github.com/agrimm/bioruby-alignment-threading-replication/blob/master/test/test_multithreaded_alignment.rb >> correct? Does it cause problems for anyone else? > > The "rescue RuntimeError" in line 15 may hide problems. > In my environment, it seems that the RuntimeError is raised > in lib/bio/alignment.rb. The error message I observed > without the rescue was > "alignment result is inconsistent with input data", > and output file created by Clustalw was unexpectedly empty. > It might be a bug of Tempfile in Ruby, but not sure. > > With Ruby 1.8.7, errors are observed in some times. > ?% ruby -v > ?ruby 1.8.7 (2010-01-10 patchlevel 249) [i686-linux] > ?ruby 1.8.7 (2009-04-08 patchlevel 160) [i686-linux] > ?ruby 1.8.7 (2008-08-11 patchlevel 72) [i486-linux] > > With Ruby 1.9.1-p378, no errors when I executed several times. > ?% ruby -v > ?ruby 1.9.1p378 (2010-01-10 revision 26273) [i686-linux] > I suspect errors may occur on earlier versions of ruby 1.9.1. >> Some of the errors I get include the ones seen at http://gist.github.com/286775 > > The message "ERROR: Multiple sequences found with same name > (found 0 at least twice)!" is reported by ClustalW, and > it indicates incorrect input file sequence names. Maybe > two file contents are unexpectedly concatenated or mixed > possibly due to a bug of Tempfile, but not sure. > >> It's possible that the issues are caused by problems in tempfile >> itself (which may have been fixed in August 2009 according to the >> changelog). > > Another possibility is resource limits of the machine: > the number of child processes, total memory size, etc. > If exceeding limits, new child clustalw process could > not be started, or running clustalw processes might be > killed. This also causes void or truncated result files, > and leads to ruby-level errors. > Thanks for that suggestion. I re-ran the test using only 5 threads in the new gist http://gist.github.com/287499 > Thanks, > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > >> >> Thanks, >> >> Andrew >> >> On Thu, Jan 21, 2010 at 12:50 AM, Naohisa GOTO >> wrote: >> > Hi, >> > >> > On Wed, 20 Jan 2010 23:09:19 +1100 >> > Andrew Grimm wrote: >> > >> >> Is alignment intended to be thread-safe in bioruby? If so, should I >> >> use the same alignment factory between threads, or a separate one in >> >> each thread? >> > >> > It is not confirmed to be thread-safe, so it is safe to use >> > separate one in each thread. >> > >> > Currently, in BioRuby, manipulating the same object from different >> > threads is not intended. When manipulating the same object from >> > different threads is needed, using mutex is recommended. >> > >> > For library developers, it is encouraged to write thread-safe >> > code if possible, but not mandatory. >> > >> > Naohisa Goto >> > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org >> > >> >> >> >> Andrew >> >> _______________________________________________ >> >> BioRuby Project - http://www.bioruby.org/ >> >> BioRuby mailing list >> >> BioRuby at lists.open-bio.org >> >> http://lists.open-bio.org/mailman/listinfo/bioruby >> > >> > > > From missy at be.to Fri Jan 29 01:46:15 2010 From: missy at be.to (MISHIMA, Hiroyuki) Date: Fri, 29 Jan 2010 15:46:15 +0900 Subject: [BioRuby] Proposal: Bio::FastaFormat#each_entry Message-ID: <4B628437.30305@be.to> Hi all, How about implementing the following methods? Bio::FastaFormat#each_entry Bio::FastaNumericFormat#each_entry The following is a sample code to generate a FASTQ string from a FASTA string and a FASTA.QUAL string. This sample may need ruby 1.8.7 or later. I am afraid that simpler or easier ways are already existed in BioRuby... Hiro. ----- #!/usr/local/bin/ruby require 'rubygems' require 'bio' module Bio class FastaFormat def each_entry return to_enum(:each_entry) unless block_given? @continue = self.dup loop do yield @continue overrun = @continue.entry_overrun break unless overrun @continue = Bio::FastaFormat.new(overrun) end end end class FastaNumericFormat def each_entry return to_enum(:each_entry) unless block_given? @continue = self.dup loop do yield @continue overrun = @continue.entry_overrun break unless overrun @continue = Bio::FastaNumericFormat.new(overrun) end end end end fasta = <FXQB1I00000001 TATGGAATCTGTAGAATCAGTGGTAGGTGCAGCAGATGGAGGAAGG >FXQB1I00000002 CTGGAGAATTCTGGATCCTCGACTTATGACTTGGTGGTTCTGGTAACTGTGAGCTTAGGATAGTCAG EOS qual = <FXQB1I00000001 30 30 29 42 25 24 5 30 30 30 30 30 28 30 26 9 30 30 30 30 30 42 25 30 30 42 25 29 22 30 29 26 30 30 30 29 30 42 25 30 32 17 40 23 39 24 >FXQB1I00000002 30 30 33 19 28 30 26 9 32 12 30 30 33 20 30 30 32 15 27 27 30 28 28 34 22 27 22 28 28 29 26 9 33 19 22 43 25 33 19 28 27 32 15 30 32 12 28 30 27 30 30 26 27 30 40 23 30 40 23 30 29 29 30 30 30 29 30 EOS enum_fasta = Bio::FastaFormat.new(fasta).each_entry enum_qual = Bio::FastaNumericFormat.new(qual).each_entry loop do fastq = Bio::Sequence.adapter(enum_fasta.next, Bio::Sequence::Adapter::Fastq) fastq.quality_score_type = :phred fastq.quality_scores = enum_qual.next.data puts fastq.output(:fastq) end -- MISHIMA, Hiroyuki, DDS, Ph.D. COE Research Fellow Department of Human Genetics Nagasaki University Graduate School of Biomedical Sciences From ngoto at gen-info.osaka-u.ac.jp Fri Jan 29 05:25:29 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Fri, 29 Jan 2010 19:25:29 +0900 Subject: [BioRuby] Proposal: Bio::FastaFormat#each_entry In-Reply-To: <4B628437.30305@be.to> References: <4B628437.30305@be.to> Message-ID: <20100129102530.3BBEA1CBC436@idnmail.gen-info.osaka-u.ac.jp> Hi, On Fri, 29 Jan 2010 15:46:15 +0900 "MISHIMA, Hiroyuki" wrote: > Hi all, > > How about implementing the following methods? > > Bio::FastaFormat#each_entry > Bio::FastaNumericFormat#each_entry > > The following is a sample code to generate a FASTQ string from a FASTA > string and a FASTA.QUAL string. This sample may need ruby 1.8.7 or later. > > I am afraid that simpler or easier ways are already existed in BioRuby... I think mixing single entry parser with multiple entry iterator will cause confusion, and not good way. For most parser classes in bioruby, expected data source is String containing single entry data. In addition, for IO with possible multiple entries, Bio::FlatFile is the front-end that can detect data type, splits each entry, and calling assigned parser class. For String containing multiple entries, using StringIO and then Bio::FlatFile is the easiest way, although indirect. Recently, many efficient memory-mapped data transfer methods are available, e.g. memcached, IPC shared memory, mmap(2) system call. I'm now thinking how to treat such data efficiently. Below is an example using StringIO and Bio::FlatFile. #------------------------------------------------ require 'stringio' require 'bio' # When copy-and paste this script, the "> " in the head of # each line should be removed. > fasta = < >FXQB1I00000001 > TATGGAATCTGTAGAATCAGTGGTAGGTGCAGCAGATGGAGGAAGG > >FXQB1I00000002 > CTGGAGAATTCTGGATCCTCGACTTATGACTTGGTGGTTCTGGTAACTGTGAGCTTAGGATAGTCAG > EOS > > qual = < >FXQB1I00000001 > 30 30 29 42 25 24 5 30 30 30 30 30 28 30 26 9 30 30 30 30 30 42 25 30 30 > 42 25 29 22 30 29 26 30 30 30 29 30 42 25 30 32 17 40 23 39 24 > >FXQB1I00000002 > 30 30 33 19 28 30 26 9 32 12 30 30 33 20 30 30 32 15 27 27 30 28 28 34 > 22 27 22 28 28 29 26 9 33 19 22 43 25 33 19 28 27 32 15 30 32 12 28 30 > 27 30 30 26 27 30 40 23 30 40 23 30 29 29 30 30 30 29 30 > EOS ff_fasta = Bio::FlatFile.open(StringIO.new(fasta)) ff_qual = Bio::FlatFile.open(StringIO.new(qual)) while entry_fasta = ff.fasta.next_entry seq = entry_fasta.to_biosequence seq.quality_score_type = :phred seq.quality_scores = ff_qual.next_entry.data puts fastq.output(:fastq, :title => entry_fasta.definition) end #------------------------------------------------ > enum_fasta = Bio::FastaFormat.new(fasta).each_entry > enum_qual = Bio::FastaNumericFormat.new(qual).each_entry > > loop do > fastq = Bio::Sequence.adapter(enum_fasta.next, > Bio::Sequence::Adapter::Fastq) > fastq.quality_score_type = :phred > fastq.quality_scores = enum_qual.next.data > puts fastq.output(:fastq) > end Bio::Sequence.adapter is bioruby library internal use only, and normally should not be used by user scripts. In addition, using Adapter::Fastq for Bio::FastaFormat data is mismatch. In this case, use Bio::FastaFormat#to_biosequence. > > -- > MISHIMA, Hiroyuki, DDS, Ph.D. > COE Research Fellow > Department of Human Genetics > Nagasaki University Graduate School of Biomedical Sciences Thanks, Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From missy at be.to Fri Jan 29 06:24:15 2010 From: missy at be.to (MISHIMA, Hiroyuki) Date: Fri, 29 Jan 2010 20:24:15 +0900 Subject: [BioRuby] Proposal: Bio::FastaFormat#each_entry In-Reply-To: <20100129102530.3BBEA1CBC436@idnmail.gen-info.osaka-u.ac.jp> References: <4B628437.30305@be.to> <20100129102530.3BBEA1CBC436@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <4B62C55F.1050506@be.to> Hi, Naohisa GOTO, Thank you so much for detailed explanation and a sample code. It was big help for me to understand BioRuby's overall design. Although I used here-documents in my code, what I wanted to do was just make a FASTQ file from regular FASTA and FASTA.QUAL files. I tried your code using my relatively large input files. It was much faster than my code. The final code is simply the following: ---- require 'bio' ff_fasta = Bio::FlatFile.open(ARGV[0]) ff_qual = Bio::FlatFile.open(ARGV[0]+".qual") while entry_fasta = ff_fasta.next_entry seq = entry_fasta.to_biosequence seq.quality_score_type = :phred seq.quality_scores = ff_qual.next_entry.data puts seq.output(:fastq, :title => entry_fasta.definition) end ---- Hiro. Naohisa GOTO wrote (2010/01/29 19:25): > Hi, > > On Fri, 29 Jan 2010 15:46:15 +0900 > "MISHIMA, Hiroyuki" wrote: > >> Hi all, >> >> How about implementing the following methods? >> >> Bio::FastaFormat#each_entry >> Bio::FastaNumericFormat#each_entry >> >> The following is a sample code to generate a FASTQ string from a FASTA >> string and a FASTA.QUAL string. This sample may need ruby 1.8.7 or later. >> >> I am afraid that simpler or easier ways are already existed in BioRuby... > > I think mixing single entry parser with multiple entry iterator > will cause confusion, and not good way. > > For most parser classes in bioruby, expected data source is > String containing single entry data. In addition, for IO with > possible multiple entries, Bio::FlatFile is the front-end that > can detect data type, splits each entry, and calling assigned > parser class. > > For String containing multiple entries, using StringIO and > then Bio::FlatFile is the easiest way, although indirect. > Recently, many efficient memory-mapped data transfer methods > are available, e.g. memcached, IPC shared memory, mmap(2) > system call. I'm now thinking how to treat such data efficiently. -- MISHIMA, Hiroyuki, DDS, Ph.D. COE Research Fellow Department of Human Genetics Nagasaki University Graduate School of Biomedical Sciences From biopython at maubp.freeserve.co.uk Fri Jan 29 05:36:40 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 29 Jan 2010 10:36:40 +0000 Subject: [BioRuby] [Bioperl-l] [MOBY-dev] OpenBio solution challenge: Project updates at BOSC 2010 In-Reply-To: References: <20100128203505.GG40046@sobchak.mgh.harvard.edu> Message-ID: <320fb6e01001290236l1ad02515w403a19f94dbb6d15@mail.gmail.com> Hi all, This is a great topic but should be continue it on just the one mailing list? Is there a suitable BOSC list, or how about the general Open Bio list? On Thu, Jan 28, 2010 at 9:17 PM, Mark Wilkinson wrote: > > Brad, this sounds exciting! > > One thing strikes me, though - by asking for the sub-projects to propose > the "grand challenge" themselves the one thing you can guarantee is that > the "grand challenge" is solvable (or more likely, already solved!) > > Other "grand challenge" kinds of meetings have an independent third party > pose the problem that has to be solved, and then all groups work toward a > solution and compare their results. ?This would, IMO, be more revealing of > the "state of the art" in each Open-Bio project, and point out where the > weaknesses are that we should be focusing on... ?Someone (for example, > you!) could act as the moderator to ensure that the "grand challenge" was > at least a reasonable one, within the scope of what an Open-Bio project > *should* be able to solve... > > Just my CAD $0.02 > > Mark One possible problem with having Brad act as moderator is his ties to Biopython (plus it would be a shame if we'd be one man down for trying to solve the challenges - grin). Having a project representative "sign off" on the challenge might work - or simply the whole of the BOSC committee which is quite balanced. Alternatively some kind of panel of challenges does seem a good way to reduce individual project bias (as suggest by Scooter), but there will still need to be a judging committee. I'm curious what kind of challenges the BOSC committee had in mind - would something like taking a newly sequence bacteria and producing an automated annotation as a GenBank, EMBL, or GFF file be too ambitious for example? There are already several major projects to do this e.g. RAST http://rast.nmpdr.org/ Peter (@Biopython) From ngoto at gen-info.osaka-u.ac.jp Mon Jan 4 07:15:18 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Mon, 4 Jan 2010 16:15:18 +0900 Subject: [BioRuby] Codeml parser In-Reply-To: <20091231141546.GA5770@thebird.nl> References: <20091231141546.GA5770@thebird.nl> Message-ID: <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> Hi, I also think the current Bio::PAML::Codeml::Report is needed to be rewritten. It is great if you do so. Here is my comments. > codeml = Bio::PAML::Codeml.new(nil, :runmode => 0, :RateAncestor => 1, > :alpha => 0.5, :fix_alpha => 0) > report = codeml.query(alignment, tree) > > which, as it happens, works. The 'nil' points to the program executable. > 'nil' merely fills in 'codeml'. It would have been beter to make it one > of the listed options, e.g. :binary => 'codeml'. That would save the ugly > 'nil' parameter and belongs more to the principle of least surprise, that > makes Ruby shine. It is safe not to merge bioruby internal options and PAML's options. If the upstream authors of PAML introduced a new option named binary, severe problem would occur. One way is to write a code that acts something like C++ polymorphism. For example, the code below accepts the three cases. * Bio::PAML::Codeml.new("/path/to/codeml") * Bio::PAML::Codeml.new({ :xxx => yyy, :ppp => qqq }) * Bio::PAML::Codeml.new("/path/to/codeml", { :xxx => yyy, :ppp => qqq }) def initialize(*argv) program = nil params = {} case argv.size when 0, 1 begin params = argv[0].to_hash rescue NoMethodError program = argv[0] end when 2 program, params = *argv else raise ArgumentError, "wrong number of arguments (#{argv.size} for 2)" end # continues to the current code... The bad points are: * Complexity of code is increased. * It might make difficult to refactor codes, especially when keyword arguments are introduced in the future version of Ruby. Note that Ruby's author Matz has said that he had not applied the principle of least surprise to the design of Ruby. (http://en.wikipedia.org/wiki/Ruby_(programming_language)#Philosophy ) Please be careful that the word "principle of least surprise (POLS)" is NG word when you request something in Ruby. (http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/26942 ) > A new implementation of Bio::PAML::Codeml::Report > So I propose to rewrite the class supporting for multiple models, > with the following usage (starting from a codeml report - really result): > > >> report.models.size > => 2 > >> report.models[0].name > => "M0" I suppose report.models returns a Hash containing objects of newly written class (for example, Bio::PAML::Codeml::Report::Model) or Struct. It seems good. Existing methods could be changed to return the first model's values. > Unit tests Currently, tests with external dependencies (e.g. web services) are located in the test/functional/ directory. So, your tests running codeml would be named test/functional/bio/appl/paml/test_codeml.rb, test/functional/bio/appl/paml/codeml/test_report.rb, or something like this. > These tests, for example, can be run on a special switch: > > runner.rb --test-dependencies I'm now searching ways to pass such parameters to tests. Note that tests can also be run in various ways. For example, ruby test/unit/bio/appl/paml/codeml/test_report.rb testrb test/unit/bio/appl/paml/codeml rake test > I am sure it works, but doesn't anyone think this belongs in a support > module (e.g. BioTestFile) for testing? What I would like to see is > something less brittle: > > require 'bio/test' > str = BioTestFile::read('paml/codeml/output.txt') I'd like to keep tests simple and clear, and I think using standard File.read is enough and clearer. When using such special class, to know the behavior of the test code, reading extra file is needed. > Personally, I dislike the naming/name space scheme of Bioruby. > What to think of invoking a class named > > report = Bio::PAML::Codeml::Report.new Because there are many bioinformatics software and databases, names tends to be longer, and nesting of namespace tends to be deeper. I'd like to know naming rules and policies of other open-bio projects. > Why can't it just be > > include Bio > report = Codeml.new I think it is enough to write "include Bio::PAML" instead of (or in addition to) "include Bio". > include Bio > result = Paml.new(:program => 'codeml') I don't like introducing such new parameter like :program. I think 1 class 1 binary is better. In addition, because the differences within PAML tools (codeml, baseml, yn00, etc.) are currently not small, merging the classes is not so realistic now. On Thu, 31 Dec 2009 15:15:46 +0100 Pjotr Prins wrote: > Hi Michael, > > I have a writeup on improving the current PAML functionality. Are you > OK with this? > > http://bioruby.open-bio.org/wiki/BIORUBY_PAML > > (maybe it does not belong on the bioruby Wiki - but I think of it > like a 'design' document). > > Pj. > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From pjotr.public14 at thebird.nl Mon Jan 4 09:03:18 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 4 Jan 2010 10:03:18 +0100 Subject: [BioRuby] Bioruby design Message-ID: <20100104090318.GA16136@thebird.nl> Thanks for the reply Naohisa. As we are moving on to design, rather than one implementation I am changing the thread. On Mon, Jan 04, 2010 at 04:15:18PM +0900, Naohisa GOTO wrote: > It is safe not to merge bioruby internal options and PAML's options. > If the upstream authors of PAML introduced a new option named binary, > severe problem would occur. I am against breaking interfaces. This is a minor design problem which should be avoided in the future. And, yes, I would certainly not favour a polymorphism solution, unless unavoidable. I don't think it is worth 'fixing' this interface aspect at this stage. Perhaps, there will be opportunities later. > Note that Ruby's author Matz has said that he had not applied the > principle of least surprise to the design of Ruby. > (http://en.wikipedia.org/wiki/Ruby_(programming_language)#Philosophy ) > Please be careful that the word "principle of least surprise (POLS)" > is NG word when you request something in Ruby. > (http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/26942 ) I did not know that, and personally I do not care. I think POLS is a really good idea, though it should not automatically come at the expense of (for example) convenience, or performance. I favour easy API's, and that is where the principle of least surprise comes in. It means to me that I don't have to fetch the manuals every time (like I do with Perl). So, let's not throw away the baby with the bath water. I like POLS, as much as I like KISS. > > >> report.models[0].name > > => "M0" > > I suppose report.models returns a Hash containing objects of newly written > class (for example, Bio::PAML::Codeml::Report::Model) or Struct. > It seems good. In fact, I have made it an array. See my PAML branch. > > runner.rb --test-dependencies > > I'm now searching ways to pass such parameters to tests. In the runner you can parse the parameters first and pull them off the stack. I did something like that for cfruby: http://cfruby.rubyforge.org/git?p=cfruby.git;a=blob;f=test/runner.rb;h=c202e48783a744c4cb3e339e2b891b3eab354c3e;hb=HEAD > I'd like to keep tests simple and clear, and I think using standard > File.read is enough and clearer. When using such special class, to know > the behavior of the test code, reading extra file is needed. I disagree, but that is obvious. > > Personally, I dislike the naming/name space scheme of Bioruby. > > What to think of invoking a class named > > > > report = Bio::PAML::Codeml::Report.new > > Because there are many bioinformatics software and databases, names > tends to be longer, and nesting of namespace tends to be deeper. > I'd like to know naming rules and policies of other open-bio projects. I think we should not mirror ourselves on these. We can do better. RoR is a much better example to mirror ourselves on. > > Why can't it just be > > > > include Bio > > report = Codeml.new > > I think it is enough to write "include Bio::PAML" instead of (or in > addition to) "include Bio". Not really. It brings in another source of errors for users if they have to think about that context every time. We will get all variants, like Bio::Kegg, Bio::Sequence etc. I think name spaces are there to *avoid* conflict. If a naming scheme precludes conflict, why bring in another layer? I want Bioruby to be as easy as possible, and with the least amount of typing. More text = harder to read. > > include Bio > > result = Paml.new(:program => 'codeml') > > I don't like introducing such new parameter like :program. > I think 1 class 1 binary is better. I agree. It was just another option. > In addition, because the differences within PAML tools (codeml, baseml, > yn00, etc.) are currently not small, merging the classes is not so > realistic now. We have to separate our own conveniences from design choices. Meanwhile I do agree we should not change the current interfaces. We can create a new version of Bioruby with both old and new interfaces supported. That is one thing I propose. I am putting together a discussion document on the future of Bioruby (design choices). We will have opportunity to discuss that in Japan. We can consider raising a community vote once we have a list of options. Pj. From pjotr.public14 at thebird.nl Mon Jan 4 11:51:05 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 4 Jan 2010 12:51:05 +0100 Subject: [BioRuby] Codeml parser In-Reply-To: <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> References: <20091231141546.GA5770@thebird.nl> <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20100104115105.GA21035@thebird.nl> I have updated the writeup at http://bioruby.open-bio.org/wiki/BIORUBY_PAML have a look at my PAML branch. The (old) unit tests pass. http://github.com/pjotrp/bioruby/tree/PAML I have to add the positive selection sites, to complete it. Pj. From tomoakin at kenroku.kanazawa-u.ac.jp Mon Jan 4 12:33:20 2010 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Mon, 4 Jan 2010 21:33:20 +0900 Subject: [BioRuby] Bioruby design In-Reply-To: <20100104090318.GA16136@thebird.nl> References: <20100104090318.GA16136@thebird.nl> Message-ID: <14BB7DE3-1D1B-4180-91EE-156EAADF4A98@kenroku.kanazawa-u.ac.jp> Hi, > As people tend not to think of Paml as a toolbox I would prefer > to have one object names Paml. With behind it the codeml 'engine' > and reporter. This would work for me (also note Paml does > not return a report, but rather a result): I don't agree in this point. PHYLIP is clearly a package or collection of programs, and so is considered Molphy, PAML, ... > result = Paml.new(:program => 'codeml') And if you make a single object, it is not to obvious to divide based on the program, since aaml is now done by codeml but should be considered clearly different function. >>> include Bio >>> report = Codeml.new >>> >> >> I think it is enough to write "include Bio::PAML" instead of (or in >> addition to) "include Bio". >> > > Not really. It brings in another source of errors for users if they > have to think about that context every time. We will get all > variants, like Bio::Kegg, Bio::Sequence etc. These are short enought, since we have to write something like "PAML ver XXX (Yang, XX) was used for XX" and "KEGG (Kanehisa, XXX)"... in the manuscript of the paper if we use that module. Stating their use explicitly in the first lines of the program is considered good. On the other hand, I don't like "include Bio::Sequence", since it is a function of bioruby in itself. -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan From pjotr.public14 at thebird.nl Mon Jan 4 15:04:59 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 4 Jan 2010 16:04:59 +0100 Subject: [BioRuby] Bioruby design In-Reply-To: <14BB7DE3-1D1B-4180-91EE-156EAADF4A98@kenroku.kanazawa-u.ac.jp> References: <20100104090318.GA16136@thebird.nl> <14BB7DE3-1D1B-4180-91EE-156EAADF4A98@kenroku.kanazawa-u.ac.jp> Message-ID: <20100104150459.GB21412@thebird.nl> On Mon, Jan 04, 2010 at 09:33:20PM +0900, Tomoaki NISHIYAMA wrote: > These are short enought, since we have to write something like > "PAML ver XXX (Yang, XX) was used for XX" and "KEGG (Kanehisa, XXX)"... > in the manuscript of the paper if we use that module. > Stating their use explicitly in the first lines of the > program is considered good. Uhm. I think that is a bit far fetched. The way you propose it is that you would have to load the name space every time you use something in code: require 'bio' include Bio::PAML include Bio::Kegg include ... do something next source file, the same. And again: require 'bio' include Bio::PAML include Bio::Kegg include ... do something This is the philosophy of Python - where every source file explicitly loads all modules/name spaces. It is arguably 'clear'. But ugly. And, takes the fun out of programming (anyone mention that?). Only once I have used the Python name spacing with good effect. It was when we plugged in a replacement module - completely rewritten. That was changing one line only - and it worked :-). In Python you can say import Paml as paml it became import Paml2 as paml That was nice. But whan you see Python source files, the header is ugly, and wastes a lot of typing. See for example: http://pypi.python.org/pypi/zope.sqlalchemy#example I argue not to state imports. import Bio should be part of require 'bio' Anyway, we will have time to talk in Tokyo, I hope. Pj. P.S. Do you have an example of anyone quoting a Bioruby module in a paper? From pjotr.public14 at thebird.nl Mon Jan 4 17:09:04 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 4 Jan 2010 18:09:04 +0100 Subject: [BioRuby] Codeml parser In-Reply-To: <20100104115105.GA21035@thebird.nl> References: <20091231141546.GA5770@thebird.nl> <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100104115105.GA21035@thebird.nl> Message-ID: <20100104170904.GA26187@thebird.nl> The writeup is pretty much done, as well as the implementation. http://bioruby.open-bio.org/wiki/BIORUBY_PAML All unit tests pass: Running tests for PAML Loaded suite . Started .................... Finished in 0.398394 seconds. 20 tests, 37 assertions, 0 failures, 0 errors It is compatible with the old version. I have added 41 assertions in the doctest (the header of report.rb). === Testing 'mydoc.test'... 1. OK | Default Test 41 comparisons, 1 doctests, 0 failures, 0 errors You can view the tests and implementation at http://github.com/pjotrp/bioruby/blob/PAML/lib/bio/appl/paml/codeml/report.rb See also The branch is: http://github.com/pjotrp/bioruby/tree/PAML (don't you love github). Pj. From mail at michaelbarton.me.uk Mon Jan 4 17:50:50 2010 From: mail at michaelbarton.me.uk (Michael Barton) Date: Mon, 4 Jan 2010 12:50:50 -0500 Subject: [BioRuby] Codeml parser In-Reply-To: <20100104170904.GA26187@thebird.nl> References: <20091231141546.GA5770@thebird.nl> <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100104115105.GA21035@thebird.nl> <20100104170904.GA26187@thebird.nl> Message-ID: Hi Pjotr, The expand report.rb looks like an excellent and substantial improvement to the previous version. You could add a depreciated tag to the old interface methods and these could then be removed in a later bioruby version to decrease clutter in the API. Mike 2010/1/4 Pjotr Prins : > The writeup is pretty much done, as well as the implementation. > > ?http://bioruby.open-bio.org/wiki/BIORUBY_PAML > > All unit tests pass: > > ?Running tests for PAML > ?Loaded suite . > ?Started > ?.................... > ?Finished in 0.398394 seconds. > ?20 tests, 37 assertions, 0 failures, 0 errors > > It is compatible with the old version. I have added 41 assertions > in the doctest (the header of report.rb). > > ?=== Testing 'mydoc.test'... > ?1. ? OK ?| Default Test > ?41 comparisons, 1 doctests, 0 failures, 0 errors > > You can view the tests and implementation at > > ?http://github.com/pjotrp/bioruby/blob/PAML/lib/bio/appl/paml/codeml/report.rb > See also > > The branch is: > > ?http://github.com/pjotrp/bioruby/tree/PAML > > (don't you love github). > > Pj. > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From ngoto at gen-info.osaka-u.ac.jp Tue Jan 5 07:42:49 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Tue, 5 Jan 2010 16:42:49 +0900 Subject: [BioRuby] Codeml parser In-Reply-To: <20100104170904.GA26187@thebird.nl> References: <20091231141546.GA5770@thebird.nl> <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100104115105.GA21035@thebird.nl> <20100104170904.GA26187@thebird.nl> Message-ID: <20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp> Hi Pjotr, I'm reading the code (commit c2de9dd3ad055bab4bfb1d3e8da840493b110b0e). It is generally good. Below are my comments and suggested changes. > # == Examples > # > # Read the codeml M0-M3 data file into a buffer > # > # >> require 'bio/test/biotestfile' > # >> buf = BioTestFile.read('paml/codeml/models/results0-3.txt') It is not suitable to use such nonstandard class in the example. Users want to know the example usage and do not intend to test. Note that I still disagree with the BioTestFile class. > class Report < Bio::PAML::Common::Report > > attr_reader :models, :header, :footer RDoc documentation is also needed for attributes. To write RDoc, the three attribute definitions are needed to be separated. For example, # Models in the result # (Array containing Bio::PAML::Codeml::Model objects) attr_reader :models # ...(should be written) attr_reader :header # ...(should be written) attr_reader :footer > # Parse codeml output file passed with +buf+ > def initialize buf Details of +buf+ (class, contents, etc) should also be written in RDoc. It is recommended to use the style written in the README_DEV.rdoc, or the style used in the Ruby source code. Please do not omit parentheses in the method definition lines. > # Model class > class Model Too few documentation. At least please write a message that it is created by Bio::PAML::Codeml::Report. > def initialize buf Please write RDoc that normal users do not use the method directly, and internally called inside the Bio::PAML::Codeml::Report objects. Please do not omit parentheses in the method definition lines. > def lnL Writing RDoc document is needed. In addition, for omega, kappa, alpha, tree_length, tree, and to_s methods. > class PositiveSite Almost all methods have no RDoc documantation. > def to_a > [ @position, @aaref, @probability, @omega ] > end What is the purpose of the method? > class PositiveSites < Array To inherit Array and to create original container class is discouraged. In BioRuby, we have deprecated Bio::Features and Bio::References in version 1.3.0, although they do not inherit Array but have an array in the object. (The classes still exist only for backward compatibility, in lib/bio/compat/features.rb and references.rb). In this case, except initialize, only a method named "graph" is added. I think it is good to add the graph method in the Report class and using an Array for storing PositiveSite objects. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From pjotr.public14 at thebird.nl Tue Jan 5 10:32:12 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 5 Jan 2010 11:32:12 +0100 Subject: [BioRuby] Codeml parser In-Reply-To: <20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp> References: <20091231141546.GA5770@thebird.nl> <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100104115105.GA21035@thebird.nl> <20100104170904.GA26187@thebird.nl> <20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20100105103212.GA4584@thebird.nl> Hi Naohisa, First I thought you were kidding. But then I realise you are serious. I don't think we need to document every simple class variable/accessor to accept this source code. That is overkill. If you don't understand lnL or alpha, don't use it. We are not in the business of documenting for documenting's sake. Documenting lnL and alpha will be like: "Retrieve the lnL value from the Report" "Retrieve the alpha value from the Report" etc. etc. I don't think we should be doing that. Standard 1?to-1 relations are obvious and don't need lots of text in the code base. If someone feels like filling in these obvious statements, fine. It really goes against my grain. Do we document every single accessor? Note the previous implementation did no such thing. That code was accepted fine (and partially written by you). > Details of +buf+ (class, contents, etc) should also be written in RDoc. > It is recommended to use the style written in the README_DEV.rdoc, or > the style used in the Ruby source code. You mean the contents of the input buffer, which is the content of the input file? I see many places in Bioruby where no such a thing is done. Why become strict on this now? If you want a different descriptive name for the variable - that is fine. Propose me a better name. > > def to_a > > [ @position, @aaref, @probability, @omega ] > > end > What is the purpose of the method? Access converter. Convenience, really. You can remove it if you dislike it so much. I use it for testing and to write to a file. Could be to_s too, but that fixates the format. > > class PositiveSites < Array > > To inherit Array and to create original container class is discouraged. > In BioRuby, we have deprecated Bio::Features and Bio::References in > version 1.3.0, although they do not inherit Array but have an array > in the object. (The classes still exist only for backward compatibility, > in lib/bio/compat/features.rb and references.rb). PositiveSites object has the all the features of a list (ie Array). I think inheritance is what it should be. It is an is_a relationship. Adding a @list will just add code. Not only for initialization, but also for iterators. I only see how we can move backwards from readable code. Nor is it good OOP practice. Inheritance is not *always* bad, though I agree it is used too quickly (in general). > In this case, except initialize, only a method named "graph" is added. > I think it is good to add the graph method in the Report class and > using an Array for storing PositiveSite objects. This is awful. The graph is a feature of PositiveSites, and not of the report *parser*. To keep things simple it is best practise to have functionality where it belongs. It is good OOP design. Your proposal means the Report class becomes less obvious in what it is. Look how clean it is now! What do other people think on this list. I am at a disadvantage here. I would like this code accepted in Bioruby, so other people can use it. I disagree with most of above 'criticism'. I certainly balk at the last non-OOP ones. This is not the first time I am really unhappy. I can't believe how much trouble I have to go to for a simple class, which, as it happens, has a perfectly acceptable implementation by most measures. Pj. From jan.aerts at gmail.com Tue Jan 5 11:53:53 2010 From: jan.aerts at gmail.com (Jan Aerts) Date: Tue, 5 Jan 2010 11:53:53 +0000 Subject: [BioRuby] Codeml parser In-Reply-To: <20100105103212.GA4584@thebird.nl> References: <20091231141546.GA5770@thebird.nl> <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100104115105.GA21035@thebird.nl> <20100104170904.GA26187@thebird.nl> <20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp> <20100105103212.GA4584@thebird.nl> Message-ID: <4c7507a71001050353v4e4654abn79f9cbcaeef262e2@mail.gmail.com> All, It appears that the pre-hackathon bioruby meeting will be very useful :-) Why don't we use that time to focus on the bit-more-distant future of bioruby: bioruby 2.0? We could discuss what it should look like without having to worry about backward compatibility. Topics: * documentation style (I happen to agree with Naohisa on that) * class hierarchy: how would we organize the information if we had to start from scratch? (maybe we should follow bioperl's lead with a Root class?) * coding style * general interface decisions * ... jan. PS: Still don't know if I can make it to Japan. Will know this afternoon (broken foot might interfere...) 2010/1/5 Pjotr Prins > Hi Naohisa, > > First I thought you were kidding. But then I realise you are serious. > > I don't think we need to document every simple class variable/accessor > to accept this source code. That is overkill. If you don't understand > lnL or alpha, don't use it. We are not in the business of documenting > for documenting's sake. Documenting lnL and alpha will be like: > > "Retrieve the lnL value from the Report" > > "Retrieve the alpha value from the Report" > > etc. etc. I don't think we should be doing that. Standard 1?to-1 > relations are obvious and don't need lots of text in the code base. > > If someone feels like filling in these obvious statements, fine. It > really goes against my grain. Do we document every single accessor? > Note the previous implementation did no such thing. That code was > accepted fine (and partially written by you). > > > Details of +buf+ (class, contents, etc) should also be written in RDoc. > > It is recommended to use the style written in the README_DEV.rdoc, or > > the style used in the Ruby source code. > > You mean the contents of the input buffer, which is the content of the > input file? I see many places in Bioruby where no such a thing is > done. Why become strict on this now? If you want a different > descriptive name for the variable - that is fine. Propose me > a better name. > > > > def to_a > > > [ @position, @aaref, @probability, @omega ] > > > end > > What is the purpose of the method? > > Access converter. Convenience, really. You can remove it if you > dislike it so much. I use it for testing and to write to a file. Could > be to_s too, but that fixates the format. > > > > class PositiveSites < Array > > > > To inherit Array and to create original container class is discouraged. > > In BioRuby, we have deprecated Bio::Features and Bio::References in > > version 1.3.0, although they do not inherit Array but have an array > > in the object. (The classes still exist only for backward compatibility, > > in lib/bio/compat/features.rb and references.rb). > > PositiveSites object has the all the features of a list (ie Array). I > think inheritance is what it should be. It is an is_a relationship. > Adding a @list will just add code. Not only for initialization, but > also for iterators. I only see how we can move backwards from readable > code. Nor is it good OOP practice. Inheritance is not *always* bad, > though I agree it is used too quickly (in general). > > > In this case, except initialize, only a method named "graph" is added. > > I think it is good to add the graph method in the Report class and > > using an Array for storing PositiveSite objects. > > This is awful. The graph is a feature of PositiveSites, and not of the > report *parser*. To keep things simple it is best practise to have > functionality where it belongs. It is good OOP design. Your proposal > means the Report class becomes less obvious in what it is. Look how > clean it is now! > > What do other people think on this list. I am at a disadvantage here. > > I would like this code accepted in Bioruby, so other people can use > it. I disagree with most of above 'criticism'. I certainly balk at the > last non-OOP ones. This is not the first time I am really unhappy. I > can't believe how much trouble I have to go to for a simple class, > which, as it happens, has a perfectly acceptable implementation by > most measures. > > Pj. > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From pjotr.public14 at thebird.nl Tue Jan 5 12:39:02 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 5 Jan 2010 13:39:02 +0100 Subject: [BioRuby] Clustal ALN writer Message-ID: <20100105123902.GA10823@thebird.nl> I propose to write an ALN output writer. ALN files show aligned sequences with additional lines of information (like a match line). I want to use it to output PAML positive selection sites. This is the idea: SEQ1 alignment 1... SEQ2 alignment 2... ...*.:*....*** (match line) ...*....*..... (pos. sel. line) Do we want such ALN output (I think it is allowed), and can we allow for the additional output. I have a proposed interface here: http://github.com/pjotrp/bioruby/commit/7f320781039b56aee991ab72404655fae210e2cb I notice ClustalW.to_fasta has been obsoleted. But we don't have to_aln yet, and we need to allow adding match_lines and other information. Pj. From ngoto at gen-info.osaka-u.ac.jp Tue Jan 5 13:20:24 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Tue, 5 Jan 2010 22:20:24 +0900 Subject: [BioRuby] Codeml parser In-Reply-To: <20100105103212.GA4584@thebird.nl> References: <20091231141546.GA5770@thebird.nl> <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100104115105.GA21035@thebird.nl> <20100104170904.GA26187@thebird.nl> <20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp> <20100105103212.GA4584@thebird.nl> Message-ID: <20100105132025.B532E1CBC40C@idnmail.gen-info.osaka-u.ac.jp> Hi Pjotr, On Tue, 5 Jan 2010 11:32:12 +0100 Pjotr Prins wrote: > Hi Naohisa, > > First I thought you were kidding. But then I realise you are serious. > > I don't think we need to document every simple class variable/accessor > to accept this source code. That is overkill. If you don't understand > lnL or alpha, don't use it. We are not in the business of documenting > for documenting's sake. Documenting lnL and alpha will be like: > > "Retrieve the lnL value from the Report" > > "Retrieve the alpha value from the Report" > > etc. etc. I don't think we should be doing that. Standard 1-to-1 > relations are obvious and don't need lots of text in the code base. Even just one word is OK, e.g. "lnL", "alpha". But no RDoc is not allowed. Ideally, it may be really great if well informative description can help people unfamiliar with Codeml, and this may encourage people beginning to use Codeml with BioRuby. I understand this can not be easily achieved. When writing a new class or largely adding codes, it is also good to implement first with least documentation and later to improve documents gradually. > If someone feels like filling in these obvious statements, fine. It > really goes against my grain. Do we document every single accessor? > Note the previous implementation did no such thing. That code was > accepted fine (and partially written by you). In late 2005, we determined that all methods, attributes, classes, modules, etc. should be documented by using RDoc. Codes written before earlier 2006 may have no RDoc. I'm working to add RDoc in such codes gradually, but not finished yet. > > Details of +buf+ (class, contents, etc) should also be written in RDoc. > > It is recommended to use the style written in the README_DEV.rdoc, or > > the style used in the Ruby source code. > > You mean the contents of the input buffer, which is the content of the > input file? I see many places in Bioruby where no such a thing is > done. Why become strict on this now? If you want a different > descriptive name for the variable - that is fine. Propose me > a better name. No need to change the variable name. I mean I want to clarify that it points contents of the file and not filename. If you think current description is enough apparent, it is OK. > > > def to_a > > > [ @position, @aaref, @probability, @omega ] > > > end > > What is the purpose of the method? > > Access converter. Convenience, really. You can remove it if you > dislike it so much. I use it for testing and to write to a file. Could > be to_s too, but that fixates the format. OK if you feel useful. > > > class PositiveSites < Array > > > > To inherit Array and to create original container class is discouraged. > > In BioRuby, we have deprecated Bio::Features and Bio::References in > > version 1.3.0, although they do not inherit Array but have an array > > in the object. (The classes still exist only for backward compatibility, > > in lib/bio/compat/features.rb and references.rb). > > PositiveSites object has the all the features of a list (ie Array). I > think inheritance is what it should be. It is an is_a relationship. > Adding a @list will just add code. Not only for initialization, but > also for iterators. I only see how we can move backwards from readable > code. Nor is it good OOP practice. Inheritance is not *always* bad, > though I agree it is used too quickly (in general). > > > In this case, except initialize, only a method named "graph" is added. > > I think it is good to add the graph method in the Report class and > > using an Array for storing PositiveSite objects. > > This is awful. The graph is a feature of PositiveSites, and not of the > report *parser*. To keep things simple it is best practise to have > functionality where it belongs. It is good OOP design. Your proposal > means the Report class becomes less obvious in what it is. Look how > clean it is now! I respect your design if the class is not only a container of PositiveSite objects but also having methods doing special things by using relations among two or more objects which is not a simple accumulation of each object's information. > What do other people think on this list. I am at a disadvantage here. > > I would like this code accepted in Bioruby, so other people can use > it. I disagree with most of above 'criticism'. I certainly balk at the > last non-OOP ones. This is not the first time I am really unhappy. I > can't believe how much trouble I have to go to for a simple class, > which, as it happens, has a perfectly acceptable implementation by > most measures. > > Pj. > Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From ngoto at gen-info.osaka-u.ac.jp Tue Jan 5 13:28:28 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Tue, 5 Jan 2010 22:28:28 +0900 Subject: [BioRuby] Clustal ALN writer In-Reply-To: <20100105123902.GA10823@thebird.nl> References: <20100105123902.GA10823@thebird.nl> Message-ID: <20100105132829.27F7B1CBC401@idnmail.gen-info.osaka-u.ac.jp> Hi Pjotr, There is already Bio::Alignment#output_clustal method. It is implemented in Bio::Alignment::Output module. http://bioruby.org/rdoc/classes/Bio/Alignment/Output.html#M000092 Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Tue, 5 Jan 2010 13:39:02 +0100 Pjotr Prins wrote: > I propose to write an ALN output writer. ALN files show aligned > sequences with additional lines of information (like a match line). I > want to use it to output PAML positive selection sites. This is > the idea: > > > SEQ1 alignment 1... > SEQ2 alignment 2... > ...*.:*....*** (match line) > ...*....*..... (pos. sel. line) > > Do we want such ALN output (I think it is allowed), and can we allow > for the additional output. I have a proposed interface here: > > http://github.com/pjotrp/bioruby/commit/7f320781039b56aee991ab72404655fae210e2cb > > I notice ClustalW.to_fasta has been obsoleted. But we don't have > to_aln yet, and we need to allow adding match_lines and other > information. > > Pj. > From pjotr.public14 at thebird.nl Tue Jan 5 17:04:34 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 5 Jan 2010 18:04:34 +0100 Subject: [BioRuby] Codeml parser In-Reply-To: <20100105132025.B532E1CBC40C@idnmail.gen-info.osaka-u.ac.jp> References: <20091231141546.GA5770@thebird.nl> <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100104115105.GA21035@thebird.nl> <20100104170904.GA26187@thebird.nl> <20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp> <20100105103212.GA4584@thebird.nl> <20100105132025.B532E1CBC40C@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20100105170434.GB13498@thebird.nl> Hi Naohisa, Thanks for clarifying. I am happy now. Pj. From pjotr.public14 at thebird.nl Tue Jan 5 17:09:25 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 5 Jan 2010 18:09:25 +0100 Subject: [BioRuby] Clustal ALN writer In-Reply-To: <20100105132829.27F7B1CBC401@idnmail.gen-info.osaka-u.ac.jp> References: <20100105123902.GA10823@thebird.nl> <20100105132829.27F7B1CBC401@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20100105170925.GA13828@thebird.nl> On Tue, Jan 05, 2010 at 10:28:28PM +0900, Naohisa GOTO wrote: > Hi Pjotr, > > There is already Bio::Alignment#output_clustal method. > It is implemented in Bio::Alignment::Output module. > > http://bioruby.org/rdoc/classes/Bio/Alignment/Output.html#M000092 I missed that. Still it has no functionality for adding the match_line, nor for adding extra information lines. Can I modify this to give this method an optional parameter (list of String) for this? The Alignment class is not aware of 'imported' match lines (it is Clustal specific in Bioruby at this stage). How do you suppose we can do this so I can generate the ALN with multiple match lines? Pj. From ngoto at gen-info.osaka-u.ac.jp Wed Jan 6 03:31:25 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 6 Jan 2010 12:31:25 +0900 Subject: [BioRuby] Clustal ALN writer In-Reply-To: <20100105170925.GA13828@thebird.nl> References: <20100105123902.GA10823@thebird.nl> <20100105132829.27F7B1CBC401@idnmail.gen-info.osaka-u.ac.jp> <20100105170925.GA13828@thebird.nl> Message-ID: <20100106033125.754941CBC3D4@idnmail.gen-info.osaka-u.ac.jp> Hi, On Tue, 5 Jan 2010 18:09:25 +0100 Pjotr Prins wrote: > On Tue, Jan 05, 2010 at 10:28:28PM +0900, Naohisa GOTO wrote: > > Hi Pjotr, > > > > There is already Bio::Alignment#output_clustal method. > > It is implemented in Bio::Alignment::Output module. > > > > http://bioruby.org/rdoc/classes/Bio/Alignment/Output.html#M000092 > > I missed that. Still it has no functionality for adding the > match_line, nor for adding extra information lines. Can I modify this > to give this method an optional parameter (list of String) for this? > > The Alignment class is not aware of 'imported' match lines (it is Clustal > specific in Bioruby at this stage). The output_clustal method gets an argument named "options" as a Hash. The match line can be altered by any given string with an option. alignment.output_clustal(:match_line => str) I'm very sorry for incomplete documentation. It was first written in 2003, and documents were added after 2005 but still incomplete. Bio::Alignment#match_line method is the match line calculation method with the same algorithm as ClustalW. > How do you suppose we can do this so I can generate the ALN with > multiple match lines? I'm afraid this is not regarded as Clustal format. Of course, it is technically easy to add such function. There may be many private extensions of Clustal format. I think this is OK because Clustal format is rough, although this makes hard to validate Clustal format. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From pjotr.public14 at thebird.nl Wed Jan 6 08:07:10 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Wed, 6 Jan 2010 09:07:10 +0100 Subject: [BioRuby] Clustal ALN writer In-Reply-To: <20100106033125.754941CBC3D4@idnmail.gen-info.osaka-u.ac.jp> References: <20100105123902.GA10823@thebird.nl> <20100105132829.27F7B1CBC401@idnmail.gen-info.osaka-u.ac.jp> <20100105170925.GA13828@thebird.nl> <20100106033125.754941CBC3D4@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20100106080710.GA23141@thebird.nl> On Wed, Jan 06, 2010 at 12:31:25PM +0900, Naohisa GOTO wrote: > > How do you suppose we can do this so I can generate the ALN with > > multiple match lines? > > I'm afraid this is not regarded as Clustal format. > Of course, it is technically easy to add such function. > > There may be many private extensions of Clustal format. > I think this is OK because Clustal format is rough, > although this makes hard to validate Clustal format. Standards are vague. EMBOSS does not even mention the match line, but as ClustalW generates it we assume it is a 'standard'. I think most parsers basically ignore lines starting with white space. So multiple 'match lines' should normally work. Many standards in bioinformatics evolve from use - maybe my idea will become a standard one day ;-). I think it is a nice feature to have. I'll add a warning that one should use it with caution. BTW the ALN-writer should really live in its own class/module, similar to the current layout for the 'Report' class (which in reality is an ALN parser, or ALN-reader). It is no surprise I did not find either of them when I was looking for an implementation. OK, I'll cook something up in a separate git branch. Pj. From mail at michaelbarton.me.uk Wed Jan 6 16:58:01 2010 From: mail at michaelbarton.me.uk (Michael Barton) Date: Wed, 6 Jan 2010 11:58:01 -0500 Subject: [BioRuby] Codeml parser In-Reply-To: <4c7507a71001050353v4e4654abn79f9cbcaeef262e2@mail.gmail.com> References: <20091231141546.GA5770@thebird.nl> <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100104115105.GA21035@thebird.nl> <20100104170904.GA26187@thebird.nl> <20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp> <20100105103212.GA4584@thebird.nl> <4c7507a71001050353v4e4654abn79f9cbcaeef262e2@mail.gmail.com> Message-ID: 2010/1/5 Jan Aerts : > It appears that the pre-hackathon bioruby meeting will be very useful :-) > Why don't we use that time to focus on the bit-more-distant future of > bioruby: bioruby 2.0? We could discuss what it should look like without > having to worry about backward compatibility. I second what Jan has suggested about the direction of BioRuby and version 2.0. As Ruby becomes more popular a programming language in bioinformatics it might be expected that BioRuby will receive more and more contributions. Prior to BioRuby 2.0 might be a nice time to discuss how BioRuby will grow and be organised as it increases in size. Topics: > * documentation style (I happen to agree with Naohisa on that) > * class hierarchy: how would we organize the information if we had to start > from scratch? (maybe we should follow bioperl's lead with a Root class?) > * coding style > * general interface decisions > * ... > > jan. > > PS: Still don't know if I can make it to Japan. Will know this afternoon > (broken foot might interfere...) > > 2010/1/5 Pjotr Prins > >> Hi Naohisa, >> >> First I thought you were kidding. But then I realise you are serious. >> >> I don't think we need to document every simple class variable/accessor >> to accept this source code. That is overkill. If you don't understand >> lnL or alpha, don't use it. We are not in the business of documenting >> for documenting's sake. ?Documenting lnL and alpha will be like: >> >> "Retrieve the lnL value from the Report" >> >> "Retrieve the alpha value from the Report" >> >> etc. etc. I don't think we should be doing that. Standard 1?to-1 >> relations are obvious and don't need lots of text in the code base. >> >> If someone feels like filling in these obvious statements, fine. It >> really goes against my grain. Do we document every single accessor? >> Note the previous implementation did no such thing. That code was >> accepted fine (and partially written by you). >> >> > Details of +buf+ (class, contents, etc) should also be written in RDoc. >> > It is recommended to use the style written in the README_DEV.rdoc, or >> > the style used in the Ruby source code. >> >> You mean the contents of the input buffer, which is the content of the >> input file? I see many places in Bioruby where no such a thing is >> done. ?Why become strict on this now? If you want a different >> descriptive name for the variable - that is fine. Propose me >> a better name. >> >> > > ? ? ?def to_a >> > > ? ? ? ?[ @position, @aaref, @probability, @omega ] >> > > ? ? ?end >> > What is the purpose of the method? >> >> Access converter. Convenience, really. You can remove it if you >> dislike it so much. I use it for testing and to write to a file. Could >> be to_s too, but that fixates the format. >> >> > > ? ?class PositiveSites < Array >> > >> > To inherit Array and to create original container class is discouraged. >> > In BioRuby, we have deprecated Bio::Features and Bio::References in >> > version 1.3.0, although they do not inherit Array but have an array >> > in the object. (The classes still exist only for backward compatibility, >> > in lib/bio/compat/features.rb and references.rb). >> >> PositiveSites object has the all the features of a list (ie Array). I >> think inheritance is what it should be. It is an is_a relationship. >> Adding a @list will just add code. Not only for initialization, but >> also for iterators. I only see how we can move backwards from readable >> code. Nor is it good OOP practice. Inheritance is not *always* bad, >> though I agree it is used too quickly (in general). >> >> > In this case, except initialize, only a method named "graph" is added. >> > I think it is good to add the graph method in the Report class and >> > using an Array for storing PositiveSite objects. >> >> This is awful. The graph is a feature of PositiveSites, and not of the >> report *parser*. To keep things simple it is best practise to have >> functionality where it belongs. It is good OOP design. Your proposal >> means the Report class becomes less obvious in what it is. Look how >> clean it is now! >> >> What do other people think on this list. I am at a disadvantage here. >> >> I would like this code accepted in Bioruby, so other people can use >> it. I disagree with most of above 'criticism'. I certainly balk at the >> last non-OOP ones. This is not the first time I am really unhappy. I >> can't believe how much trouble I have to go to for a simple class, >> which, as it happens, has a perfectly acceptable implementation by >> most measures. >> >> Pj. >> >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby >> > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From jan.aerts at gmail.com Fri Jan 8 16:29:07 2010 From: jan.aerts at gmail.com (Jan Aerts) Date: Fri, 8 Jan 2010 16:29:07 +0000 Subject: [BioRuby] Codeml parser In-Reply-To: <4c7507a71001050353v4e4654abn79f9cbcaeef262e2@mail.gmail.com> References: <20091231141546.GA5770@thebird.nl> <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100104115105.GA21035@thebird.nl> <20100104170904.GA26187@thebird.nl> <20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp> <20100105103212.GA4584@thebird.nl> <4c7507a71001050353v4e4654abn79f9cbcaeef262e2@mail.gmail.com> Message-ID: <4c7507a71001080829u179e48f9jfc81b0590fd34872@mail.gmail.com> Maybe it'd be a good idea to start thinking at a level removed from actual code, and create some general design documents first. Maybe we should * describe what we actually want to achieve with the bioruby toolkit: should it be a library foremost, or should it rather be an interface to run other programs (e.g. BLAST)? * make a high-level overview of different parts of bioruby: - how do we handle file formats: are the files actual objects, or do they merely describe a biological entity? E.g. does a FASTA file merit the instantiation of a FASTA object, or is it nothing more than a container of Sequence objects? - how do different parts of the library interact? Should we have a Root class such as in bioperl? What type of class should be used to interface with the world (e.g. file parsing)? What type of class should be used to actually contain the object data (e.g. annotated sequence)? When that's done: come up with general guidelines for coding, e.g. always use keyword-based argument lists or something (just an example). jan. 2010/1/5 Jan Aerts > All, > > It appears that the pre-hackathon bioruby meeting will be very useful :-) > Why don't we use that time to focus on the bit-more-distant future of > bioruby: bioruby 2.0? We could discuss what it should look like without > having to worry about backward compatibility. Topics: > * documentation style (I happen to agree with Naohisa on that) > * class hierarchy: how would we organize the information if we had to start > from scratch? (maybe we should follow bioperl's lead with a Root class?) > * coding style > * general interface decisions > * ... > > jan. > > PS: Still don't know if I can make it to Japan. Will know this afternoon > (broken foot might interfere...) > > 2010/1/5 Pjotr Prins > > Hi Naohisa, >> >> First I thought you were kidding. But then I realise you are serious. >> >> I don't think we need to document every simple class variable/accessor >> to accept this source code. That is overkill. If you don't understand >> lnL or alpha, don't use it. We are not in the business of documenting >> for documenting's sake. Documenting lnL and alpha will be like: >> >> "Retrieve the lnL value from the Report" >> >> "Retrieve the alpha value from the Report" >> >> etc. etc. I don't think we should be doing that. Standard 1?to-1 >> relations are obvious and don't need lots of text in the code base. >> >> If someone feels like filling in these obvious statements, fine. It >> really goes against my grain. Do we document every single accessor? >> Note the previous implementation did no such thing. That code was >> accepted fine (and partially written by you). >> >> > Details of +buf+ (class, contents, etc) should also be written in RDoc. >> > It is recommended to use the style written in the README_DEV.rdoc, or >> > the style used in the Ruby source code. >> >> You mean the contents of the input buffer, which is the content of the >> input file? I see many places in Bioruby where no such a thing is >> done. Why become strict on this now? If you want a different >> descriptive name for the variable - that is fine. Propose me >> a better name. >> >> > > def to_a >> > > [ @position, @aaref, @probability, @omega ] >> > > end >> > What is the purpose of the method? >> >> Access converter. Convenience, really. You can remove it if you >> dislike it so much. I use it for testing and to write to a file. Could >> be to_s too, but that fixates the format. >> >> > > class PositiveSites < Array >> > >> > To inherit Array and to create original container class is discouraged. >> > In BioRuby, we have deprecated Bio::Features and Bio::References in >> > version 1.3.0, although they do not inherit Array but have an array >> > in the object. (The classes still exist only for backward compatibility, >> > in lib/bio/compat/features.rb and references.rb). >> >> PositiveSites object has the all the features of a list (ie Array). I >> think inheritance is what it should be. It is an is_a relationship. >> Adding a @list will just add code. Not only for initialization, but >> also for iterators. I only see how we can move backwards from readable >> code. Nor is it good OOP practice. Inheritance is not *always* bad, >> though I agree it is used too quickly (in general). >> >> > In this case, except initialize, only a method named "graph" is added. >> > I think it is good to add the graph method in the Report class and >> > using an Array for storing PositiveSite objects. >> >> This is awful. The graph is a feature of PositiveSites, and not of the >> report *parser*. To keep things simple it is best practise to have >> functionality where it belongs. It is good OOP design. Your proposal >> means the Report class becomes less obvious in what it is. Look how >> clean it is now! >> >> What do other people think on this list. I am at a disadvantage here. >> >> I would like this code accepted in Bioruby, so other people can use >> it. I disagree with most of above 'criticism'. I certainly balk at the >> last non-OOP ones. This is not the first time I am really unhappy. I >> can't believe how much trouble I have to go to for a simple class, >> which, as it happens, has a perfectly acceptable implementation by >> most measures. >> >> Pj. >> >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby >> > > From pjotr.public14 at thebird.nl Fri Jan 8 17:21:32 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Fri, 8 Jan 2010 18:21:32 +0100 Subject: [BioRuby] Codeml parser In-Reply-To: <4c7507a71001080829u179e48f9jfc81b0590fd34872@mail.gmail.com> References: <20091231141546.GA5770@thebird.nl> <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> <20100104115105.GA21035@thebird.nl> <20100104170904.GA26187@thebird.nl> <20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp> <20100105103212.GA4584@thebird.nl> <4c7507a71001050353v4e4654abn79f9cbcaeef262e2@mail.gmail.com> <4c7507a71001080829u179e48f9jfc81b0590fd34872@mail.gmail.com> Message-ID: <20100108172132.GA28895@thebird.nl> On Fri, Jan 08, 2010 at 04:29:07PM +0000, Jan Aerts wrote: > Maybe it'd be a good idea to start thinking at a level removed from actual > code, and create some general design documents first. Maybe we should > * describe what we actually want to achieve with the bioruby toolkit: should > it be a library foremost, or should it rather be an interface to run other > programs (e.g. BLAST)? I think calling into other programs is a good feature, but should be really split out. Likewise for web services. Both split in terms of objects and directory layout. Currently there is too intertwined functionality. Then there is support for reading and writing standard formats. Then there is extra functionality (not found elsewhere, perhaps). And we have Rails support and the shell. All these should be clearly split out. I don't think we have to choose. We can have it all. Just make sure it sits in the right location. > * make a high-level overview of different parts of bioruby: > - how do we handle file formats: are the files actual objects, or do they > merely describe a biological entity? E.g. does a FASTA file merit the > instantiation of a FASTA object, or is it nothing more than a container of > Sequence objects? > - how do different parts of the library interact? Should we have a Root > class such as in bioperl? What type of class should be used to interface > with the world (e.g. file parsing)? What type of class should be used to > actually contain the object data (e.g. annotated sequence)? > > When that's done: come up with general guidelines for coding, e.g. always > use keyword-based argument lists or something (just an example). These choices are design choices and have to originate in a list of shared 'values'. Because if we don't agree on a value there will always be arguments and disagreement. One value would be 'clear documentation', but this may collide with 'clear source code'. Similarly 'Easy to use code' and 'Concise code' may collide. Or functional choices over OOP. We need to put those values together and rank them in importance. Once the ranking is set we can make easy choices in guidelines. I am writing a type of Manifest. I'll present that in the coming weeks, when I feel I am ready. It is meant for discussion in Japan, and after. Pj. From pjotr.public14 at thebird.nl Mon Jan 11 14:40:41 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 11 Jan 2010 15:40:41 +0100 Subject: [BioRuby] Clustal ALN writer Message-ID: <20100111144041.GA31684@thebird.nl> I have created an colorized HTML alignment file with consensus information and amino acids showing evidence of positive selection (based on PAML output). http://thebird.nl/projects/test_color2.html I did a write up on the implementation at: http://bioruby.open-bio.org/wiki/BIORUBY_ALNCOLOR Enjoy, Pj. From ngoto at gen-info.osaka-u.ac.jp Tue Jan 12 09:29:57 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Tue, 12 Jan 2010 18:29:57 +0900 Subject: [BioRuby] Clustal ALN writer In-Reply-To: <20100111144041.GA31684@thebird.nl> References: <20100111144041.GA31684@thebird.nl> Message-ID: <20100112092957.A16001CBC49E@idnmail.gen-info.osaka-u.ac.jp> Hi, I'm not sure whether the prefix Bio::Html is suitable or not. By the way, I'v tried some of your code in http://github.com/pjotrp/bioruby/blob/color-alignment/ and found potential XSS. a = Bio::Alignment.new a.add_seq('ATCCATGG', '') a.add_seq('ATGCATGC', '') a.add_seq('', 'c') simple = Bio::Html::HtmlAlignment.new(a, :title => '') html = simple.html() File.open('/tmp/xss.html', 'w') { |w| w.print html } For sequences, sequence names, and consensus lines, using CGI.escapeHTML() will always be needed. For the :title, if script users can set the title, it should be escaped, but this prevents script programmers using html tags in the title. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Mon, 11 Jan 2010 15:40:41 +0100 Pjotr Prins wrote: > I have created an colorized HTML alignment file with consensus > information and amino acids showing evidence of positive selection > (based on PAML output). > > http://thebird.nl/projects/test_color2.html > > I did a write up on the implementation at: > > http://bioruby.open-bio.org/wiki/BIORUBY_ALNCOLOR > > Enjoy, > > Pj. > > > > From pjotr.public14 at thebird.nl Tue Jan 12 10:11:32 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 12 Jan 2010 11:11:32 +0100 Subject: [BioRuby] Bioruby HTML output Message-ID: <20100112101132.GC10308@thebird.nl> On Tue, Jan 12, 2010 at 06:29:57PM +0900, Naohisa GOTO wrote: > I'm not sure whether the prefix Bio::Html is suitable or not. Me neither ;). This is something to discuss when we meet. See my write up on partitioning based on functionality or standards. > By the way, I'v tried some of your code in > http://github.com/pjotrp/bioruby/blob/color-alignment/ > and found potential XSS. > > a = Bio::Alignment.new > a.add_seq('ATCCATGG', '') > a.add_seq('ATGCATGC', '') > a.add_seq('', 'c') > simple = Bio::Html::HtmlAlignment.new(a, > :title => '') > html = simple.html() > File.open('/tmp/xss.html', 'w') { |w| w.print html } > > For sequences, sequence names, and consensus lines, > using CGI.escapeHTML() will always be needed. > > For the :title, if script users can set the title, it > should be escaped, but this prevents script programmers > using html tags in the title. Perhaps the HTML generator should escape its output. Though I personally think we should only be worried about security concerns when people *enter* new data on input forms. That is when exploits show up. I can argue that HTML generation should not concern itself with HOW the inputs are presented. One advantage of having a programmer set the 'title' is that he *can* embed HTML. Perhaps escaping HTML is the responsibility of the programmer providing the data. And therefore to the logic that handles input. We have had a similar discussion before. We have to decide to what level *output* code should concern itself with *input* security. I have a feeling that too much of Bioruby classes try to do too much. How do we stay away from cluttering the code? How do we decide that callers should not use HTML and handle security concerns? You write: > a.add_seq('ATCCATGG', '') If a programmer wants that - it is his concern in my opion. If he is concerned about exploits he should not allow it. The Alignment class does not care either. It is none of its business. BTW I fixed a number of PAML::Codeml bugs on this branch. So you can ignore the existing PAML branch. Let's continue with the color coding, assuming you can live with the PAML::Codeml implementation, as it stands. Pj. From donttrustben at gmail.com Tue Jan 12 12:52:42 2010 From: donttrustben at gmail.com (Ben Woodcroft) Date: Tue, 12 Jan 2010 22:52:42 +1000 Subject: [BioRuby] SPTR problem Message-ID: Hi, While parsing all the yeast UniProt txt files I came across a problem with the gn parser - it was returning an array when I expected a hash. Looking at the code the problem seems to be this when statement: when /Name=/,/ORFNames=/ @data['GN'] = gn_uniprot_parser else @data['GN'] = gn_old_parser end http://www.uniprot.org/uniprot/A2P2R3.txt has the problem on the 5th line: GN OrderedLocusNames=YMR084W; So GN line had OrderedLocusNames= but not Name= or ORFNames=, so it didn't use the new parser, like the other entries I came across. Should all 4 possibilities be tested for in the when statement: (Synonyms= being the 4th)? Also, while I'm here: * why does the returned hash have different keys than are in the file? e.g. ORFNames becomes :orfs? * I also found the parsing process for whole genomes quite slow (multiple hours for well annotated ones). * is there any standard way to handle concatenated UniProt files? I wrote my own as it was simple. Thanks, ben -- FYI: My email addresses at unimelb, uq and gmail all redirect to the same place. From ngoto at gen-info.osaka-u.ac.jp Wed Jan 13 02:58:00 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 13 Jan 2010 11:58:00 +0900 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <20100112101132.GC10308@thebird.nl> References: <20100112101132.GC10308@thebird.nl> Message-ID: <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> Hi, On Tue, 12 Jan 2010 11:11:32 +0100 Pjotr Prins wrote: > On Tue, Jan 12, 2010 at 06:29:57PM +0900, Naohisa GOTO wrote: > > I'm not sure whether the prefix Bio::Html is suitable or not. > > Me neither ;). This is something to discuss when we meet. See my > write up on partitioning based on functionality or standards. > > > By the way, I'v tried some of your code in > > http://github.com/pjotrp/bioruby/blob/color-alignment/ > > and found potential XSS. > > > > a = Bio::Alignment.new > > a.add_seq('ATCCATGG', '') > > a.add_seq('ATGCATGC', '') > > a.add_seq('', 'c') > > simple = Bio::Html::HtmlAlignment.new(a, > > :title => '') > > html = simple.html() > > File.open('/tmp/xss.html', 'w') { |w| w.print html } > > > > For sequences, sequence names, and consensus lines, > > using CGI.escapeHTML() will always be needed. > > > > For the :title, if script users can set the title, it > > should be escaped, but this prevents script programmers > > using html tags in the title. > > Perhaps the HTML generator should escape its output. Though I > personally think we should only be worried about security concerns > when people *enter* new data on input forms. That is when exploits > show up. I can argue that HTML generation should not concern itself > with HOW the inputs are presented. One advantage of having a > programmer set the 'title' is that he *can* embed HTML. Perhaps > escaping HTML is the responsibility of the programmer providing the > data. And therefore to the logic that handles input. Even apart from security, sequence names (and sequences) that contain html special characters may not be correctly displayed. For example, sequences with three parameters a, b, and c. % cat test.aln CLUSTAL 2.0.9 multiple sequence alignment 15_c<7 FKNVFTVIKTEFNSHRVKIDSHFHGIWIAGAPPEGTDVYIKWFLQ a>3_511 FKNVMSVIKTEFNSHRVKIDSHFHGIWIAGAPPEGTDVYIKTFLQ ****::*********************************** *** % irb -r bio irb> report = Bio::ClustalW::Report.new(File.read('test.aln')) irb> alignment = report.alignment irb> simple = Bio::Html::HtmlAlignment.new(alignment, :title => 'a,b,c') irb> File.open('abc.html', 'w') { |w| w.print simple.html() } The sequence names were correctly treated by ClustalW 2.0.9, but unexpected representation. This problem can not be solved with input data escaping. If the sequence name "15_c<7" is escaped to "1<a<3_b>5_c<7" before calling the method, text indentation will be broken because of the mismatch of text length and html display width. To solve this, to escape when building the html format by output formatting method will be needed. > We have had a similar discussion before. We have to decide to what > level *output* code should concern itself with *input* security. I > have a feeling that too much of Bioruby classes try to do too much. > How do we stay away from cluttering the code? How do we decide that > callers should not use HTML and handle security concerns? It is difficult not to use HTML-like string which we want to be treated as normal unformatted string but unexpectedly treated as HTML by some programs, e.g. the above example. For security, I'd like to ask security experts. Anyone in this list? I think escaping should be done by formatting layer and should be turned on by default, because: * Only the output formatting layer knows how the input data is processed. * In many cases, the data comes from outside, and we can not expect it is safe enough. * Different escaping rules are needed for different output types, e.g. SQL, html, XML, TeX, GNUPLOT, PostScript, R scripts. Escaping by output methods seems natural, and helps to switch output formats without concerning escaping issues specific to each output format. > You write: > > > a.add_seq('ATCCATGG', '') > > If a programmer wants that - it is his concern in my opion. If he is > concerned about exploits he should not allow it. The Alignment class > does not care either. It is none of its business. The example is extreme case. For security, please ask experts. Apart from the security, I wish ">", "<", "&", etc. can be displayed correctly. I think methods to build HTML format should concern this. > BTW I fixed a number of PAML::Codeml bugs on this branch. So you > can ignore the existing PAML branch. Let's continue with the color > coding, assuming you can live with the PAML::Codeml implementation, > as it stands. When do you want the Bio::PAML::Codeml code to be merged to the blessed bioruby repository? Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From tomoakin at kenroku.kanazawa-u.ac.jp Wed Jan 13 06:57:11 2010 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Wed, 13 Jan 2010 15:57:11 +0900 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> References: <20100112101132.GC10308@thebird.nl> <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> Hi, Happy New Year! > For security, I'd like to ask security experts. > Anyone in this list? Though I am not an expert, in a Japanese blog, http://takagi-hiromitsu.jp/diary/20051227.html Hiromitsu Takagi writes the reason why escaping should be default at the output point, from a security points, which sounds me reasonable, though I do not know an english literature. In addition, > * Different escaping rules are needed for different output types, > e.g. SQL, html, XML, TeX, GNUPLOT, PostScript, R scripts. > Escaping by output methods seems natural, and helps to switch > output formats without concerning escaping issues specific > to each output format. this is a good argument. If html tag containing title is necessary, a non-default API that does accept html marked text rather than the normal text should be considered. -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan On 2010/01/13, at 11:58, Naohisa GOTO wrote: > Hi, > > On Tue, 12 Jan 2010 11:11:32 +0100 > Pjotr Prins wrote: > >> On Tue, Jan 12, 2010 at 06:29:57PM +0900, Naohisa GOTO wrote: >>> I'm not sure whether the prefix Bio::Html is suitable or not. >> >> Me neither ;). This is something to discuss when we meet. See my >> write up on partitioning based on functionality or standards. >> >>> By the way, I'v tried some of your code in >>> http://github.com/pjotrp/bioruby/blob/color-alignment/ >>> and found potential XSS. >>> >>> a = Bio::Alignment.new >>> a.add_seq('ATCCATGG', '') >>> a.add_seq('ATGCATGC', '') >>> a.add_seq('', 'c') >>> simple = Bio::Html::HtmlAlignment.new(a, >>> :title => '') >>> html = simple.html() >>> File.open('/tmp/xss.html', 'w') { |w| w.print html } >>> >>> For sequences, sequence names, and consensus lines, >>> using CGI.escapeHTML() will always be needed. >>> >>> For the :title, if script users can set the title, it >>> should be escaped, but this prevents script programmers >>> using html tags in the title. >> >> Perhaps the HTML generator should escape its output. Though I >> personally think we should only be worried about security concerns >> when people *enter* new data on input forms. That is when exploits >> show up. I can argue that HTML generation should not concern itself >> with HOW the inputs are presented. One advantage of having a >> programmer set the 'title' is that he *can* embed HTML. Perhaps >> escaping HTML is the responsibility of the programmer providing the >> data. And therefore to the logic that handles input. > > Even apart from security, sequence names (and sequences) that > contain html special characters may not be correctly displayed. > > For example, sequences with three parameters a, b, and c. > > % cat test.aln > CLUSTAL 2.0.9 multiple sequence alignment > > > 15_c<7 FKNVFTVIKTEFNSHRVKIDSHFHGIWIAGAPPEGTDVYIKWFLQ > a>3_511 FKNVMSVIKTEFNSHRVKIDSHFHGIWIAGAPPEGTDVYIKTFLQ > ****::*********************************** *** > % irb -r bio > irb> report = Bio::ClustalW::Report.new(File.read('test.aln')) > irb> alignment = report.alignment > irb> simple = Bio::Html::HtmlAlignment.new(alignment, :title => > 'a,b,c') > irb> File.open('abc.html', 'w') { |w| w.print simple.html() } > > The sequence names were correctly treated by ClustalW 2.0.9, > but unexpected representation. > > This problem can not be solved with input data escaping. > If the sequence name "15_c<7" is escaped to > "1<a<3_b>5_c<7" before calling the method, > text indentation will be broken because of the mismatch of > text length and html display width. To solve this, to > escape when building the html format by output formatting > method will be needed. > >> We have had a similar discussion before. We have to decide to what >> level *output* code should concern itself with *input* security. I >> have a feeling that too much of Bioruby classes try to do too much. >> How do we stay away from cluttering the code? How do we decide that >> callers should not use HTML and handle security concerns? > > It is difficult not to use HTML-like string which we want > to be treated as normal unformatted string but unexpectedly > treated as HTML by some programs, e.g. the above example. > > For security, I'd like to ask security experts. > Anyone in this list? > > I think escaping should be done by formatting layer and > should be turned on by default, because: > * Only the output formatting layer knows how the input data > is processed. > * In many cases, the data comes from outside, and we can not > expect it is safe enough. > * Different escaping rules are needed for different output types, > e.g. SQL, html, XML, TeX, GNUPLOT, PostScript, R scripts. > Escaping by output methods seems natural, and helps to switch > output formats without concerning escaping issues specific > to each output format. > >> You write: >> >>> a.add_seq('ATCCATGG', '') >> >> If a programmer wants that - it is his concern in my opion. If he is >> concerned about exploits he should not allow it. The Alignment class >> does not care either. It is none of its business. > > The example is extreme case. For security, please ask experts. > Apart from the security, I wish ">", "<", "&", etc. can be > displayed correctly. I think methods to build HTML format > should concern this. > >> BTW I fixed a number of PAML::Codeml bugs on this branch. So you >> can ignore the existing PAML branch. Let's continue with the color >> coding, assuming you can live with the PAML::Codeml implementation, >> as it stands. > > When do you want the Bio::PAML::Codeml code to be merged to the > blessed bioruby repository? > > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From pjotr.public14 at thebird.nl Wed Jan 13 07:37:06 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Wed, 13 Jan 2010 08:37:06 +0100 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> References: <20100112101132.GC10308@thebird.nl> <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> Message-ID: <20100113073706.GA25611@thebird.nl> Hi all, OK, I'll adapt the output generator to escape symbols. And I think you are right it belongs in the generator. There are three scenario's really: 1. Output that never contains symbols (sequence) 2. Output that can contain symbols, but should be escaped (descriptions, id's) 3. Output that can contain HTML In my case I have all three. I think with a sequence we can assume the content is a legal string. Escaping is overkill and (if needed) points to a bigger problem. I think we should not clutter the code with (1) - or degrade performance by default. Case (2) yes! case (3), like a title or some text to plug in, we should escape by default, but add a parameter :html_escape == false for the cases the user wants to plug in HTML. OK? Pj. From tomoakin at kenroku.kanazawa-u.ac.jp Wed Jan 13 09:44:01 2010 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Wed, 13 Jan 2010 18:44:01 +0900 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <20100113073706.GA25611@thebird.nl> References: <20100112101132.GC10308@thebird.nl> <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> <20100113073706.GA25611@thebird.nl> Message-ID: <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> Hi, > I think with a sequence we can assume the content is a legal string. > Escaping is overkill and (if needed) points to a bigger problem. I > think we should not clutter the code with (1) - or degrade performance > by default. If we are talking on Bio::Html::HtmlAlignment, it is better to escape even for sequence or matchlines to make the class more independent of the implementation of alignment class. Note that sim4 uses >>>...>>> in its matchline, and a future intron aware amino acid alignment processing program might use special characters to indicate introns. If the performance is really a problem and it is in Bio::Alignment::Output, and the constructor guarantees that there is no special characters, then the escape may be skipped. Escaping everything is the default simple program structure and removing that process is a kind of optimization with some programming effort to guarantee its validity without escaping. -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan On 2010/01/13, at 16:37, Pjotr Prins wrote: > Hi all, > > OK, I'll adapt the output generator to escape symbols. And I think > you are right it belongs in the generator. There are three scenario's > really: > > 1. Output that never contains symbols (sequence) > 2. Output that can contain symbols, but should be escaped > (descriptions, id's) > 3. Output that can contain HTML > > In my case I have all three. > > I think with a sequence we can assume the content is a legal string. > Escaping is overkill and (if needed) points to a bigger problem. I > think we should not clutter the code with (1) - or degrade performance > by default. > > Case (2) yes! > > case (3), like a title or some text to plug in, we should escape by > default, but add a parameter :html_escape == false for the cases > the user > wants to plug in HTML. > > OK? > > Pj. > From pjotr.public14 at thebird.nl Fri Jan 15 14:00:59 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Fri, 15 Jan 2010 15:00:59 +0100 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> References: <20100112101132.GC10308@thebird.nl> <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> <20100113073706.GA25611@thebird.nl> <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> Message-ID: <20100115140059.GA24948@thebird.nl> On second thought, escaping is less obvious than I thought. I can escape all generated HTML, but that leaves no way to customize the output. Say I want to include an href in a sequence descriptor - which is a fairly typical requirement - that would be disabled. Likewise if someone wants to customize the title or footer - or even the information on the match_line. The problem here is that we are defining use - forcing the generated HTML into a straight jacket by adding business logic. Are we really telling our users not to use HTML in sequence descriptors, even if it is tied to one type of output? I don't like it. I am going to add a 'master' switch for escaping of HTML. The default will be with escaping. Pj. On Wed, Jan 13, 2010 at 06:44:01PM +0900, Tomoaki NISHIYAMA wrote: > Hi, > >> I think with a sequence we can assume the content is a legal string. >> Escaping is overkill and (if needed) points to a bigger problem. I >> think we should not clutter the code with (1) - or degrade performance >> by default. > > > If we are talking on Bio::Html::HtmlAlignment, > it is better to escape even for sequence or matchlines to make > the class more independent of the implementation of alignment class. > Note that sim4 uses >>>...>>> in its matchline, and a future > intron aware amino acid alignment processing program might use > special characters to indicate introns. > > If the performance is really a problem and > it is in Bio::Alignment::Output, and the constructor guarantees > that there is no special characters, then the escape may be skipped. > Escaping everything is the default simple program structure and > removing that process is a kind of optimization with some programming > effort > to guarantee its validity without escaping. > -- > Tomoaki NISHIYAMA > > Advanced Science Research Center, > Kanazawa University, > 13-1 Takara-machi, > Kanazawa, 920-0934, Japan > > > On 2010/01/13, at 16:37, Pjotr Prins wrote: > >> Hi all, >> >> OK, I'll adapt the output generator to escape symbols. And I think >> you are right it belongs in the generator. There are three scenario's >> really: >> >> 1. Output that never contains symbols (sequence) >> 2. Output that can contain symbols, but should be escaped >> (descriptions, id's) >> 3. Output that can contain HTML >> >> In my case I have all three. >> >> I think with a sequence we can assume the content is a legal string. >> Escaping is overkill and (if needed) points to a bigger problem. I >> think we should not clutter the code with (1) - or degrade performance >> by default. >> >> Case (2) yes! >> >> case (3), like a title or some text to plug in, we should escape by >> default, but add a parameter :html_escape == false for the cases the >> user >> wants to plug in HTML. >> >> OK? >> >> Pj. >> > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From ngoto at gen-info.osaka-u.ac.jp Fri Jan 15 17:19:12 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Sat, 16 Jan 2010 02:19:12 +0900 Subject: [BioRuby] SPTR problem In-Reply-To: References: Message-ID: <20100115171913.4396B1CBC40C@idnmail.gen-info.osaka-u.ac.jp> Hi, On Tue, 12 Jan 2010 22:52:42 +1000 Ben Woodcroft wrote: > Hi, > > While parsing all the yeast UniProt txt files I came across a problem with > the gn parser - it was returning an array when I expected a hash. Looking at > the code the problem seems to be this when statement: > > when /Name=/,/ORFNames=/ > @data['GN'] = gn_uniprot_parser > else > @data['GN'] = gn_old_parser > end > > http://www.uniprot.org/uniprot/A2P2R3.txt has the problem on the 5th line: > > GN OrderedLocusNames=YMR084W; > > So GN line had OrderedLocusNames= but not Name= or ORFNames=, so it didn't > use the new parser, like the other entries I came across. Should all 4 > possibilities be tested for in the when statement: (Synonyms= being the > 4th)? It seems to be a bug. Perhaps there were no (or very few) entries which only had OrderedLocusNames= when the code was first written in 2005. The commit Id in git was b5c3342437ed698f215a87ea72c6cabf0575709d. The GN format was changed in UniProtKB release 2.0 of 05-Jul-2004. The document http://www.uniprot.org/docs/sp_news.htm says: | The new format of the GN line is: | | GN Name=; Synonyms=[, ...]; OrderedLocusNames=[, ...]; | GN ORFNames=[, ...]; | | None of the above four tokens are mandatory. But a "Synonyms" token can only be present if there is a "Name" token. You are right the 4 possibilities should be considered. "Synonyms" can be eliminated, but it may be safe to be included. > Also, while I'm here: > * why does the returned hash have different keys than are in the file? e.g. > ORFNames becomes :orfs? I don't know. Now, I think using the same names as described in the original entries may be preferred, too. > * I also found the parsing process for whole genomes quite slow (multiple > hours for well annotated ones). Please use profiler to find bottlenecks. % ruby -rprofile xxx.rb > * is there any standard way to handle concatenated UniProt files? I wrote my > own as it was simple. What type of "concatenated" do you mean? For simple concatenation, for example, original file distributed from UniProt FTP site, Bio::FlatFile can be used. ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz (please gunzip before reading!) ff = Bio::FlatFile.open("uniprot_sprot.dat") ff.each do |e| puts e.entry_id end > > Thanks, > ben Thank you. -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From tomoakin at kenroku.kanazawa-u.ac.jp Sat Jan 16 05:36:02 2010 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Sat, 16 Jan 2010 14:36:02 +0900 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <20100115140059.GA24948@thebird.nl> References: <20100112101132.GC10308@thebird.nl> <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> <20100113073706.GA25611@thebird.nl> <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> Message-ID: <4B515042.7020204@kenroku.kanazawa-u.ac.jp> Hi, Pjotr Prins wrote: > On second thought, escaping is less obvious than I thought. I can > escape all generated HTML, but that leaves no way to customize the > output. Say I want to include an href in a sequence descriptor - which > is a fairly typical requirement - that would be disabled. I agree this. Having a link to original sequence on the name is usually good idea. > I am going to add a 'master' switch for escaping of HTML. The default > will be with escaping. How do you think to test if the object responds to to_html and then call to_html else pass to escapeHTML. The object may internally plain text and htmlized text or plain text plus link information or just the plain text but cares how is output as html inline element. If properly imlemented, it can generate a link from "gi|112233|..." within a text and cache for the converted result. The object can also simply pass the user supplied html. I think it is a predictable use that user supplied sequence be aligned with sequences obtained from databases. Isn't it better to be able to regard user supplied text as a simple text but the sequence from databases having proper link? This may not be simple with a master switch. From pjotr.public14 at thebird.nl Sat Jan 16 08:30:41 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sat, 16 Jan 2010 09:30:41 +0100 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <4B515042.7020204@kenroku.kanazawa-u.ac.jp> References: <20100112101132.GC10308@thebird.nl> <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> <20100113073706.GA25611@thebird.nl> <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> Message-ID: <20100116083041.GA2663@thebird.nl> On Sat, Jan 16, 2010 at 02:36:02PM +0900, Tomoaki NISHIYAMA wrote: > > I am going to add a 'master' switch for escaping of HTML. The default > > will be with escaping. > > How do you think to test if the object responds to to_html > and then call to_html else pass to escapeHTML. In this case the object to convert to HTML is a String and part of Bio::Alignment. Later implementations of Bio::Alignment could use a Bio::Sequence.id (or something Naohisa wrote me). It would mean we would have to create a Bio::Sequence::Descriptor object, which would contain several specialistic 'output' generators. This is a recurrent idea we need to discuss. I think *all* HTML based stuff should be in its own objects - and its own tree (I have created bio/output/html for that purpose). I think it is a bad idea to clutter regular BioRuby code with HTML specific stuff. Likewise for other outputs, as you pointed out, like plotting. Output should live in bio/lib/output/html bio/lib/output/plot bio/lib/output/gtk bio/lib/output/rails (perhaps) (etc) that way display code never pollutes the simple Bio::Sequence object, for example. You'll get Bio::Html::Sequence for that - or my preferred naming Bio::HtmlSequence. Now if Bio::HtmlSequence could be plugged into Bio::Alignment - the latter would not care - and we could adapt the HtmlSequence info to show embedded hrefs. That would be the proper way to handle it. No testing of methods (like to_html), but use the object structure to define what is supported (and not). Until we implement that (get Bio::Alignment to support arbitrary Sequence objects) I think the master switch is fine. I have updated my branch. Default behaviour is escaping. If a user (like me) wants it otherwise, it is allowed. Pj. From tomoakin at kenroku.kanazawa-u.ac.jp Sun Jan 17 05:12:35 2010 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Sun, 17 Jan 2010 14:12:35 +0900 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <20100116083041.GA2663@thebird.nl> References: <20100112101132.GC10308@thebird.nl> <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> <20100113073706.GA25611@thebird.nl> <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> <20100116083041.GA2663@thebird.nl> Message-ID: <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> Hi, On 2010/01/16, at 17:30, Pjotr Prins wrote: > On Sat, Jan 16, 2010 at 02:36:02PM +0900, Tomoaki NISHIYAMA wrote: >>> I am going to add a 'master' switch for escaping of HTML. The >>> default >>> will be with escaping. >> >> How do you think to test if the object responds to to_html >> and then call to_html else pass to escapeHTML. > > In this case the object to convert to HTML is a String and part of > Bio::Alignment. Later implementations of Bio::Alignment could use a > Bio::Sequence.id (or something Naohisa wrote me). It would mean we > would have to create a Bio::Sequence::Descriptor object, which would > contain several specialistic 'output' generators. For the meanwhile I don't expect that sophisticated mechanism to automatically generate proper HTML, but simply add a mean to distinguish what should be escaped as a normal course and what is specifically prepared as html by the user. A user can write: class HTMLString < String def to_html self end end a = Bio::Alignment.new a.add_seq('ATCCATGG', HTMLString.new('a')) # this is html under the responsibility of the programmer a.add_seq('ATGCATGC', '') # this is not html; don't care on '<', or '>' simple = Bio::Html::HtmlAlignment.new(a, :title => HTMLString.new('A fancy HTML title')) html = simple.html() If Bio::Alignment does not force the object given to be String, such code should be possible without the change in Bio::Alignment, and only the HtmlAlignment class and the programmer needs to know it. So, HTML specific code does not need go to regular BioRuby code. > That would be the proper way to handle it. No testing of methods > (like to_html), but use the object structure to define what is > supported (and not). I'm not sure what do you mean by "use the object structure". How do you distinguish a plain text and HTML text? -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan On 2010/01/16, at 17:30, Pjotr Prins wrote: > On Sat, Jan 16, 2010 at 02:36:02PM +0900, Tomoaki NISHIYAMA wrote: >>> I am going to add a 'master' switch for escaping of HTML. The >>> default >>> will be with escaping. >> >> How do you think to test if the object responds to to_html >> and then call to_html else pass to escapeHTML. > > In this case the object to convert to HTML is a String and part of > Bio::Alignment. Later implementations of Bio::Alignment could use a > Bio::Sequence.id (or something Naohisa wrote me). It would mean we > would have to create a Bio::Sequence::Descriptor object, which would > contain several specialistic 'output' generators. > > This is a recurrent idea we need to discuss. > > I think *all* HTML based stuff should be in its own objects - and its > own tree (I have created bio/output/html for that purpose). > > I think it is a bad idea to clutter regular BioRuby code with HTML > specific stuff. Likewise for other outputs, as you pointed out, like > plotting. Output should live in > > bio/lib/output/html > bio/lib/output/plot > bio/lib/output/gtk > bio/lib/output/rails (perhaps) > (etc) > > that way display code never pollutes the simple Bio::Sequence object, > for example. You'll get Bio::Html::Sequence for that - or my > preferred naming Bio::HtmlSequence. > > Now if Bio::HtmlSequence could be plugged into Bio::Alignment - the > latter would not care - and we could adapt the HtmlSequence info to > show embedded hrefs. > > That would be the proper way to handle it. No testing of methods > (like to_html), but use the object structure to define what is > supported (and not). > > Until we implement that (get Bio::Alignment to support arbitrary > Sequence objects) I think the master switch is fine. I have updated > my branch. Default behaviour is escaping. If a user (like me) wants > it otherwise, it is allowed. > > Pj. > From pjotr.public14 at thebird.nl Sun Jan 17 13:54:41 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sun, 17 Jan 2010 14:54:41 +0100 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> References: <20100112101132.GC10308@thebird.nl> <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> <20100113073706.GA25611@thebird.nl> <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> <20100116083041.GA2663@thebird.nl> <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> Message-ID: <20100117135441.GA24341@thebird.nl> Hi Tomoaki, Thanks for you responses. I really appreciate it. On Sun, Jan 17, 2010 at 02:12:35PM +0900, Tomoaki NISHIYAMA wrote: > A user can write: > > class HTMLString < String > def to_html > self > end > end > > a = Bio::Alignment.new > a.add_seq('ATCCATGG', HTMLString.new('a')) There is at least one 'problem' with this approach. This assumes that Bio::Alignment will keep its current implementation. Currently Bio::Alignment stores a list of descriptions, and a list of sequences. As Naohisa wrote me two weeks ago, this is before Bio::Sequence had its own identifier/descriptor. If we redesign Bio::Alignment there is a large chance we will store Bio::Sequence instead of two lists (I, for one, would certainly favour that). The other problem is more about OOP. In your example you say once it is an HTML object (HTMLString) and next you add a specific method for html 'to_html'. Twice it is 'told' that it generates HTML. 'to_html' also implies something of a transformation. We should opt for a different method name (generate_html, perhaps, or html) class HTMLString def html end end The 'responsibility' of the output is with HTMLString. Good. This way an implementation of Bio::Alignment does not need to know about HTML, but still can generate the output, at the user's request. > # this is html under the responsibility of the programmer > > a.add_seq('ATGCATGC', '') > # this is not html; don't care on '<', or '>' > > simple = Bio::Html::HtmlAlignment.new(a, > :title => HTMLString.new('A fancy HTML title')) > html = simple.html() > > If Bio::Alignment does not force the object given to be String, > such code should be possible without the change in Bio::Alignment, > and only the HtmlAlignment class and the programmer needs to know it. > So, HTML specific code does not need go to regular BioRuby code. HTMLAlignment should not care either how the HTML is generated.. It is really up to the container holding the sequence, or description, what the output is. What I don't like about proposed approach is that HTMLAlignment gets an object, needs to check for an 'to_html or html' method (ugly), and if it does not exist, needs to escape the information (by calling the to_s method?). That is a lot of formal checking I need to do for every output generated. >> That would be the proper way to handle it. No testing of methods >> (like to_html), but use the object structure to define what is >> supported (and not). > > I'm not sure what do you mean by "use the object structure". > How do you distinguish a plain text and HTML text? The output is generated by an HTML aware container. We can agree to use one method 'html' method. Create different types of objects: HTMLSequence.html - generates formatted HTML ColorHTMLSequence.html - generates formatted color HTML EscapedHTMLSequence.html - generated escaped native stuff And if someone wanted it, he could create: Sequence.html - generates plain text This would prevent downstream 'checking' of object responsibilities. We can assume the user knows he is going to use HTMLAlignment and therefore we can expect him to pass in a known HTML supported Sequence object. The reason to get the responsibility in the right place is to create as clean as possible code. You really don't want downstream checking of methods. We can further discuss in Japan. At least it is clear we have several options. Pj. > -- > Tomoaki NISHIYAMA > > Advanced Science Research Center, > Kanazawa University, > 13-1 Takara-machi, > Kanazawa, 920-0934, Japan > > > On 2010/01/16, at 17:30, Pjotr Prins wrote: > >> On Sat, Jan 16, 2010 at 02:36:02PM +0900, Tomoaki NISHIYAMA wrote: >>>> I am going to add a 'master' switch for escaping of HTML. The >>>> default >>>> will be with escaping. >>> >>> How do you think to test if the object responds to to_html >>> and then call to_html else pass to escapeHTML. >> >> In this case the object to convert to HTML is a String and part of >> Bio::Alignment. Later implementations of Bio::Alignment could use a >> Bio::Sequence.id (or something Naohisa wrote me). It would mean we >> would have to create a Bio::Sequence::Descriptor object, which would >> contain several specialistic 'output' generators. >> >> This is a recurrent idea we need to discuss. >> >> I think *all* HTML based stuff should be in its own objects - and its >> own tree (I have created bio/output/html for that purpose). >> >> I think it is a bad idea to clutter regular BioRuby code with HTML >> specific stuff. Likewise for other outputs, as you pointed out, like >> plotting. Output should live in >> >> bio/lib/output/html >> bio/lib/output/plot >> bio/lib/output/gtk >> bio/lib/output/rails (perhaps) >> (etc) >> >> that way display code never pollutes the simple Bio::Sequence object, >> for example. You'll get Bio::Html::Sequence for that - or my >> preferred naming Bio::HtmlSequence. >> >> Now if Bio::HtmlSequence could be plugged into Bio::Alignment - the >> latter would not care - and we could adapt the HtmlSequence info to >> show embedded hrefs. >> >> That would be the proper way to handle it. No testing of methods >> (like to_html), but use the object structure to define what is >> supported (and not). >> >> Until we implement that (get Bio::Alignment to support arbitrary >> Sequence objects) I think the master switch is fine. I have updated >> my branch. Default behaviour is escaping. If a user (like me) wants >> it otherwise, it is allowed. >> >> Pj. >> > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From donttrustben at gmail.com Tue Jan 19 02:15:30 2010 From: donttrustben at gmail.com (Ben Woodcroft) Date: Tue, 19 Jan 2010 12:15:30 +1000 Subject: [BioRuby] SPTR problem In-Reply-To: <20100115171913.4396B1CBC40C@idnmail.gen-info.osaka-u.ac.jp> References: <20100115171913.4396B1CBC40C@idnmail.gen-info.osaka-u.ac.jp> Message-ID: Hi, Thanks for the response. embedded. 2010/1/16 Naohisa GOTO > > It seems to be a bug. Perhaps there were no (or very few) entries > which only had OrderedLocusNames= when the code was first written > in 2005. The commit Id in git was b5c3342437ed698f215a87ea72c6cabf0575709d. > I was figuring that. Also, since no actual exception was thrown, errors might not have been noticed. I wrote a patch for this that I've been using internally, but haven't included unit tests. http://github.com/wwood/bioruby/commit/b2f6cb0b Happy to write tests, but you seem to rewrite my patches anyway.. > > The GN format was changed in UniProtKB release 2.0 of 05-Jul-2004. > The document http://www.uniprot.org/docs/sp_news.htm says: > | The new format of the GN line is: > | > | GN Name=; Synonyms=[, ...]; > OrderedLocusNames=[, ...]; > | GN ORFNames=[, ...]; > | > | None of the above four tokens are mandatory. But a "Synonyms" token can > only be present if there is a "Name" token. > > You are right the 4 possibilities should be considered. > "Synonyms" can be eliminated, but it may be safe to be included. > > > Also, while I'm here: > > * why does the returned hash have different keys than are in the file? > e.g. > > ORFNames becomes :orfs? > > I don't know. Now, I think using the same names as described > in the original entries may be preferred, too. > What do you suggest we do about this? > > > * I also found the parsing process for whole genomes quite slow (multiple > > hours for well annotated ones). > > Please use profiler to find bottlenecks. > % ruby -rprofile xxx.rb > I tried to do something like that but in the end found it easier to pre-grep the uniprot file, keeping only the lines relevant to me. There was too many levels of indirection in my code for me to bother tracking it down. > > > * is there any standard way to handle concatenated UniProt files? I wrote > my > > own as it was simple. > > What type of "concatenated" do you mean? > For simple concatenation, for example, original file distributed > from UniProt FTP site, Bio::FlatFile can be used. > > ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz > (please gunzip before reading!) > > ff = Bio::FlatFile.open("uniprot_sprot.dat") > ff.each do |e| > puts e.entry_id > end > More evidence I'm an idiot. Like I needed any. Thanks, ben From pjotr.public14 at thebird.nl Tue Jan 19 10:50:56 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 19 Jan 2010 11:50:56 +0100 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <20100117135441.GA24341@thebird.nl> References: <20100112101132.GC10308@thebird.nl> <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> <20100113073706.GA25611@thebird.nl> <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> <20100116083041.GA2663@thebird.nl> <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> <20100117135441.GA24341@thebird.nl> Message-ID: <20100119105056.GA29525@thebird.nl> Based on Tomoaki's comments I propose the following: The requirements are: A- input objects that know about HTML should generate that B- other input files get escapeHTML(object.to_s) For a container/displayer to recognize object A, object A should have a method to_html: class ObjectA def to_html end end If to_html does not exist to_s is called - and escaped. The principle will go into a mixin for the container class. Everyone OK with this? Pj. From ktym at hgc.jp Tue Jan 19 12:41:31 2010 From: ktym at hgc.jp (Toshiaki Katayama) Date: Tue, 19 Jan 2010 21:41:31 +0900 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <20100119105056.GA29525@thebird.nl> References: <20100112101132.GC10308@thebird.nl> <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> <20100113073706.GA25611@thebird.nl> <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> <20100116083041.GA2663@thebird.nl> <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> <20100117135441.GA24341@thebird.nl> <20100119105056.GA29525@thebird.nl> Message-ID: <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp> Dear Pj and all, I'm sorry that I could not spare enough time to follow this thread but I'd like to add some comments. Firstly, I don't like to use the method name 'to_html' as we already deprecated to use 'to_fasta' because 'to_' is reserved for conversion of the class in Ruby's convention (above two methods just convert String to String). We (Nakao-san and me) are now working to improve our TogoWS service (http://togows.dbcls.jp) by supporting RDF output. I hope to propose a generalized way to achieve this (hopefully, before the BioHackathon 2010 http://hackathon3.dbcls.jp/). Our current attempt is to have an 'output' method in the Bio::DB class and each sub-class implements actual 'output_*' methods relevant to appropriate formats. # This kind of requirements may also be true for classes other than # the Bio::DB (for example, Bio::Sequence, Alignment, Newick classes), # so we may put this interface in the top level class (Bio::Root?), # which does not exist for now, though. In TogoWS, we internally use the BioRuby library, and the URI http://togows.dbcls.jp/entry/exampledb/1/definition is sent to the 'definition' method defined in the Bio::ExampleDB class. Similarly, we can map '.' notation in the following URLs to call output method using their suffix as a format specifier. http://togows.dbcls.jp/entry/exampledb/1.rdf http://togows.dbcls.jp/entry/exampledb/1.fasta Therefore, these can be mapped to output(:rdf) and output(:fasta) method calls to the Bio::ExampleDB class, respectively. All we need to do is to add these methods in every database class comprehensively. I think this is simple enough and beautiful. I'll attach a primitive pseudo code in below. Comments are welcome. Regards, Toshiaki Katayama module Bio class DB def output(format) send("output_#{format.to_s.downcase}") end end end module Bio class ExampleDB < DB # output sequence of the entry in FASTA format def output_fasta ">#{@entry_id} #{@definition}\n#{@sequence}\n" end # output contents of the entry in RDF (N3) format def output_rdf prefix_subject = "http://togows.dbcls.jp/entry/exampledb" prefix_predicate = "http://togows.dbcls.jp/ontology/exampledb" "<#{prefix_subject}/#{@entry_id}>\t<#{prefix_predicate}#definition>\t#{@definition} .\n" + "<#{prefix_subject}/#{@entry_id}>\t<#{prefix_predicate}#sequence>\t#{@sequence} .\n" end # output contents of the entry in HTML format def output_html "

#{@entry_id}

... blah, blah, blah ..." end end end entry = Bio::ExampleDB.new(str) entry.output(:fasta) # => # >ENTRY_ID # atgcatgcatgcatgcatgc entry.output(:rdf) # => # "DEFINITION" . # "atgcatgcatgcatgc" . On 2010/01/19, at 19:50, Pjotr Prins wrote: > Based on Tomoaki's comments I propose the following: > > The requirements are: > > A- input objects that know about HTML should generate that > B- other input files get escapeHTML(object.to_s) > > For a container/displayer to recognize object A, object A should have > a method to_html: > > class ObjectA > def to_html > end > end > > If to_html does not exist to_s is called - and escaped. The principle > will go into a mixin for the container class. > > Everyone OK with this? > > Pj. > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From tomoakin at kenroku.kanazawa-u.ac.jp Tue Jan 19 14:05:17 2010 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Tue, 19 Jan 2010 23:05:17 +0900 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp> References: <20100112101132.GC10308@thebird.nl> <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> <20100113073706.GA25611@thebird.nl> <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> <20100116083041.GA2663@thebird.nl> <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> <20100117135441.GA24341@thebird.nl> <20100119105056.GA29525@thebird.nl> <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp> Message-ID: Hi, > Firstly, I don't like to use the method name 'to_html' as we already > deprecated to use 'to_fasta' because 'to_' is reserved for conversion > of the class in Ruby's convention (above two methods just convert > String to String). I think HTML and String should be actually a different class. There are to_i and to_f for conversion between subclasses of Numeric, yet this isn't denied because the conversion is Numeric to Numeric. a string " aaa" in HTML is "<a href=example.com> aaa</a>" but HTML " aaa" in HTML is " aaa" The return value of to_html should be a different class than String. So, the point is > def output_html > "

#{@entry_id}

... blah, blah, blah ..." > end how to regulate the different behavior of @entry_id. If the nature of entry_id is plain text, that should be escaped. On the other hand sometimes the user may want to use html aware object for whatever purpose (color, link, etc...). When we want to mix them with data supplied from outside, say user input into CGI, those data shall usually be treated as plain text and suppress any interference with html. #!/usr/local/bin/ruby require 'bio' require 'cgi' class Bio::HTMLString < String def to_html self end end def Bio::generate_html(object) if object.respond_to?(:to_html) object.to_html else string = CGI.escapeHTML(object.to_s) #fall back to escaping Bio::HTMLString.new(string) end end p Bio::generate_html(12) p Bio::generate_html(Bio::HTMLString.new(' aaa')) p Bio::generate_html(' aaa') -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan From pjotr.public14 at thebird.nl Tue Jan 19 14:34:22 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 19 Jan 2010 15:34:22 +0100 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp> References: <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> <20100113073706.GA25611@thebird.nl> <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> <20100116083041.GA2663@thebird.nl> <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> <20100117135441.GA24341@thebird.nl> <20100119105056.GA29525@thebird.nl> <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp> Message-ID: <20100119143422.GA1781@thebird.nl> On Tue, Jan 19, 2010 at 09:41:31PM +0900, Toshiaki Katayama wrote: > All we need to do is to add these methods in every database class > comprehensively. > > I think this is simple enough and beautiful. > I'll attach a primitive pseudo code in below. > Comments are welcome. I agree with Tomoaki it is too restrictive. What, indeed, if we want to present the HTML in a different way? The second comment is that I dislike the way the current files like sequence.rb and alignment.rb are mushrooming in size. There is much too much in there, which discourages people from diving in. I believe code should be readable, and easy to understand/digest. Sticking in output 'details', like HTML generation, does not help. I really would like all HTML to be in one sub-tree. Also XML, RDF and whatnot. When it is 'business' logic it should be in database. When it is output transformations it is not 'business' logic any longer. Don't you think the Sequence, or KEGG, object should not care about HTML? Or RDF, or plotting? Those are separate functionalities. They share common access patterns - which are part of the DB class. Finally, why not use method names? What is the added value of output(:html) over output_html Pj. From ktym at hgc.jp Tue Jan 19 15:33:30 2010 From: ktym at hgc.jp (Toshiaki Katayama) Date: Wed, 20 Jan 2010 00:33:30 +0900 Subject: [BioRuby] Bioruby HTML output In-Reply-To: References: <20100112101132.GC10308@thebird.nl> <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp> <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> <20100113073706.GA25611@thebird.nl> <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> <20100116083041.GA2663@thebird.nl> <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> <20100117135441.GA24341@thebird.nl> <20100119105056.GA29525@thebird.nl> <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp> Message-ID: Nishiyama-san, I couldn't catch what you are trying to do... (maybe because I didn't read throughout the thread) On 2010/01/19, at 23:05, Tomoaki NISHIYAMA wrote: > Hi, > >> Firstly, I don't like to use the method name 'to_html' as we already >> deprecated to use 'to_fasta' because 'to_' is reserved for conversion >> of the class in Ruby's convention (above two methods just convert >> String to String). > > I think HTML and String should be actually a different class. > There are to_i and to_f for conversion between subclasses of Numeric, > yet this isn't denied because the conversion is Numeric to Numeric. > > a string " aaa" in HTML is > "<a href=example.com> aaa</a>" but > HTML " aaa" in HTML is " aaa" > > The return value of to_html should be a different class than String. If the method is named as to_html, it might return a HTML object. But, from my view point, a html string is still just a text and escaping the html string is responsibility of a programmer depending on where the string will be used. > > So, the point is >> def output_html >> "

#{@entry_id}

... blah, blah, blah ..." >> end > > how to regulate the different behavior of @entry_id. > If the nature of entry_id is plain text, that should be escaped. > On the other hand sometimes the user may want to use html aware > object for whatever purpose (color, link, etc...). > When we want to mix them with data supplied > from outside, say user input into CGI, those data shall usually > be treated as plain text and suppress any interference with html. I'm talking about a database class and the contents of @entry_id is a string parsed from an flat file entry of that database (not come from outside). > > #!/usr/local/bin/ruby > require 'bio' > require 'cgi' > > class Bio::HTMLString < String > def to_html > self > end > end > def Bio::generate_html(object) > if object.respond_to?(:to_html) > object.to_html > else > string = CGI.escapeHTML(object.to_s) #fall back to escaping > Bio::HTMLString.new(string) > end > end > > p Bio::generate_html(12) > p Bio::generate_html(Bio::HTMLString.new(' aaa')) > p Bio::generate_html(' aaa') Why we need to have this functionality under the Bio name space? Toshiaki > -- > Tomoaki NISHIYAMA > > Advanced Science Research Center, > Kanazawa University, > 13-1 Takara-machi, > Kanazawa, 920-0934, Japan > From ktym at hgc.jp Tue Jan 19 16:21:54 2010 From: ktym at hgc.jp (Toshiaki Katayama) Date: Wed, 20 Jan 2010 01:21:54 +0900 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <20100119143422.GA1781@thebird.nl> References: <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp> <20100113073706.GA25611@thebird.nl> <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> <20100116083041.GA2663@thebird.nl> <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> <20100117135441.GA24341@thebird.nl> <20100119105056.GA29525@thebird.nl> <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp> <20100119143422.GA1781@thebird.nl> Message-ID: <14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp> Dear Pj, On 2010/01/19, at 23:34, Pjotr Prins wrote: > On Tue, Jan 19, 2010 at 09:41:31PM +0900, Toshiaki Katayama wrote: >> All we need to do is to add these methods in every database class >> comprehensively. >> >> I think this is simple enough and beautiful. >> I'll attach a primitive pseudo code in below. >> Comments are welcome. > > I agree with Tomoaki it is too restrictive. What, indeed, if we want > to present the HTML in a different way? Hmm. Could you provide me some use cases? Override the output_html method, or, use some template engine to be more generic. > > The second comment is that I dislike the way the current files like > sequence.rb and alignment.rb are mushrooming in size. There is much > too much in there, which discourages people from diving in. I believe > code should be readable, and easy to understand/digest. I can agree some files became too large to learn and/or maintain. But if we try to change the structure of current code base, we need to define a clean criteria beforehand. If we separate files into sub files, people then need to look around the number of files, and it may also slow down the loading speed of the bioruby library. It is a problem of balance. In both cases, lack of excellent guide to read through the bioruby library might be a essential issue. > > Sticking in output 'details', like HTML generation, does not help. > > I really would like all HTML to be in one sub-tree. Also XML, RDF and > whatnot. When it is 'business' logic it should be in database. When it > is output transformations it is not 'business' logic any longer. I'm not sure about HTML but FASTA and RDF, for example, are tightly related to the original database format/contents. So, I proposed to have methods to generate formatted string in each database class. There can be many ways to design OO class trees and to find the best way to represent/abstract things is always a difficult task. At some time, we may do refactoring to produce BioRuby 2.0. Before doing that, we can discuss how to sit all classes/codes cleanly. We may need someone who understand entire structure/contents of the current codebase and willing to design a better one with a good sense. > > Don't you think the Sequence, or KEGG, object should not care about > HTML? Or RDF, or plotting? Those are separate functionalities. They > share common access patterns - which are part of the DB class. Again, we can take both approach. My current proposal is conservative one. Just add these functionalities in each class as the class knows what is in it and what is the best way to represent the contents. If we separate formatting/plotting functionalities into separate class, which might be something like Bio::FlatFile class who knows the header line format of every database entries. Or we may design better one. Anyway, I'm now listening. So, please don't stick with HTML things only and think a global design to which we can plan to migrate. > > Finally, why not use method names? What is the added value of > > output(:html) > > over > > output_html > > Pj. Maybe from esthetics viewpoint? I think it looks better, and, we can easily switch the output format depending on the context without modifying the code. Something like a @media property in CSS (screen, print etc.) in mind. if used_for_semantic_web? format = :rdf # add some codes to do preparation job for SW elsif used_for_blast? format = :fasta # add some codes to do preparation job for blast end # we don't need to change the following line in any context entry.output(format) Toshiaki From pjotr.public14 at thebird.nl Tue Jan 19 20:52:41 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 19 Jan 2010 21:52:41 +0100 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp> References: <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> <20100116083041.GA2663@thebird.nl> <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> <20100117135441.GA24341@thebird.nl> <20100119105056.GA29525@thebird.nl> <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp> <20100119143422.GA1781@thebird.nl> <14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp> Message-ID: <20100119205241.GA7043@thebird.nl> Dear Toshiaki, On Wed, Jan 20, 2010 at 01:21:54AM +0900, Toshiaki Katayama wrote: > > I agree with Tomoaki it is too restrictive. What, indeed, if we want > > to present the HTML in a different way? > > Hmm. Could you provide me some use cases? Think of URL's. One user wants to point a gene ID to NCBI. Another to Swissprot. The container can not be aware of all exceptions - and really should not handle it. > Override the output_html method, or, use some template engine to be > more generic. Maybe those are good mechanisms. In the pre-hackathon we should discuss these points. > I can agree some files became too large to learn and/or maintain. > But if we try to change the structure of current code base, > we need to define a clean criteria beforehand. Yes. > If we separate files into sub files, people then need to look around > the number of files, and it may also slow down the loading speed of > the bioruby library. It is a problem of balance. > > In both cases, lack of excellent guide to read through the bioruby > library might be a essential issue. I think if we structure the files and modules well - and make them small enough - they become self-explaining. That would be my ultimate goal. > At some time, we may do refactoring to produce BioRuby 2.0. > Before doing that, we can discuss how to sit all classes/codes cleanly. > We may need someone who understand entire structure/contents of > the current codebase and willing to design a better one with a good sense. Yes. I agree it is a big step. But we should go for this type of challenge. > > Don't you think the Sequence, or KEGG, object should not care about > > HTML? Or RDF, or plotting? Those are separate functionalities. They > > share common access patterns - which are part of the DB class. > > Again, we can take both approach. My current proposal is conservative one. > Just add these functionalities in each class as the class knows what is in it > and what is the best way to represent the contents. > > If we separate formatting/plotting functionalities into separate class, > which might be something like Bio::FlatFile class who knows the header > line format of every database entries. Or we may design better one. FlatFile has some downsides. It has complicated the libraries. Complication means the modules are less easy to adapt/modify. I think it is slightly over-engineered. Maybe not enough of a problem to take it out, but I hope you see where I am coming from. > Anyway, I'm now listening. So, please don't stick with HTML things only > and think a global design to which we can plan to migrate. I have to spend a day on a writeup. In the coming two weeks. I will try to explain my ideas. > Maybe from esthetics viewpoint? > > I think it looks better, and, we can easily switch the output format > depending on the context without modifying the code. > Something like a @media property in CSS (screen, print etc.) in mind. > > if used_for_semantic_web? > format = :rdf > # add some codes to do preparation job for SW > elsif used_for_blast? > format = :fasta > # add some codes to do preparation job for blast > end > > # we don't need to change the following line in any context > entry.output(format) I see your point. The criticism is that it obfuscates the real intention of the code - i.e. it is not self documenting any longer. But, I guess, this boils down to preferences and acquired tastes. It is not obvious to a newbie, though it may be obvious for someone who is accustomed to Bioruby internals. Which may be good - depending on our basic values. Pj. From ktym at hgc.jp Wed Jan 20 00:49:37 2010 From: ktym at hgc.jp (Toshiaki Katayama) Date: Wed, 20 Jan 2010 09:49:37 +0900 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <20100119205241.GA7043@thebird.nl> References: <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> <20100116083041.GA2663@thebird.nl> <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> <20100117135441.GA24341@thebird.nl> <20100119105056.GA29525@thebird.nl> <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp> <20100119143422.GA1781@thebird.nl> <14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp> <20100119205241.GA7043@thebird.nl> Message-ID: Dear Pj, On 2010/01/20, at 5:52, Pjotr Prins wrote: > Dear Toshiaki, > > On Wed, Jan 20, 2010 at 01:21:54AM +0900, Toshiaki Katayama wrote: >>> I agree with Tomoaki it is too restrictive. What, indeed, if we want >>> to present the HTML in a different way? >> >> Hmm. Could you provide me some use cases? > > Think of URL's. One user wants to point a gene ID to NCBI. Another > to Swissprot. The container can not be aware of all exceptions - and > really should not handle it. Still not clear to me. I supposed to generate a URL string for the href attribute of . However, is there any IDs which needs to be escaped? Or do you mean to embed a HTML snippet in URL? If so, we may need to use URL encoding (URI.escape) instead of the HTML escaping (CGI.escapeHTML). > >> Override the output_html method, or, use some template engine to be >> more generic. > > Maybe those are good mechanisms. In the pre-hackathon we should > discuss these points. Is there any better replacement for Ruby's CGI library available? Requirements: - separation of the HTML from CGI CGI.escapeHTML looks ugly in terms of the naming convention (CamelCase) and the name space -- why not HTML.escape(string). Moreover, we don't want to require 'cgi' just for escaping a HTML string. - support for templates (separation of logic and presentation) I had used erb and html-template. Sometimes erb is too slow (especially when it contains a nested loop to generate a number of lists or tables). - bundled with Ruby as a standard library Otherwise, we'd better to use Rails as a default environment (from a viewpoint of popularity). > >> I can agree some files became too large to learn and/or maintain. >> But if we try to change the structure of current code base, >> we need to define a clean criteria beforehand. > > Yes. > >> If we separate files into sub files, people then need to look around >> the number of files, and it may also slow down the loading speed of >> the bioruby library. It is a problem of balance. >> >> In both cases, lack of excellent guide to read through the bioruby >> library might be a essential issue. > > I think if we structure the files and modules well - and make them > small enough - they become self-explaining. That would be my ultimate > goal. > >> At some time, we may do refactoring to produce BioRuby 2.0. >> Before doing that, we can discuss how to sit all classes/codes cleanly. >> We may need someone who understand entire structure/contents of >> the current codebase and willing to design a better one with a good sense. > > Yes. I agree it is a big step. But we should go for this type of > challenge. > >>> Don't you think the Sequence, or KEGG, object should not care about >>> HTML? Or RDF, or plotting? Those are separate functionalities. They >>> share common access patterns - which are part of the DB class. >> >> Again, we can take both approach. My current proposal is conservative one. >> Just add these functionalities in each class as the class knows what is in it >> and what is the best way to represent the contents. >> >> If we separate formatting/plotting functionalities into separate class, >> which might be something like Bio::FlatFile class who knows the header >> line format of every database entries. Or we may design better one. > > FlatFile has some downsides. It has complicated the libraries. > Complication means the modules are less easy to adapt/modify. I think > it is slightly over-engineered. Maybe not enough of a problem to take > it out, but I hope you see where I am coming from. > >> Anyway, I'm now listening. So, please don't stick with HTML things only >> and think a global design to which we can plan to migrate. > > I have to spend a day on a writeup. In the coming two weeks. I will > try to explain my ideas. OK, let's discuss about these topics as well, during the pre-hackathon meeting (7th Feb) in Tokyo with other core developers. > >> Maybe from esthetics viewpoint? >> >> I think it looks better, and, we can easily switch the output format >> depending on the context without modifying the code. >> Something like a @media property in CSS (screen, print etc.) in mind. >> >> if used_for_semantic_web? >> format = :rdf >> # add some codes to do preparation job for SW >> elsif used_for_blast? >> format = :fasta >> # add some codes to do preparation job for blast >> end >> >> # we don't need to change the following line in any context >> entry.output(format) > > I see your point. The criticism is that it obfuscates the real > intention of the code - i.e. it is not self documenting any longer. > But, I guess, this boils down to preferences and acquired tastes. It > is not obvious to a newbie, though it may be obvious for someone who > is accustomed to Bioruby internals. Which may be good - depending on > our basic values. > > Pj. Note that, you can still directly use the output_html method in each database class. The output(format) method is prepared just as an abstract interface, which will be useful in the above situation, for example. Therefore, following both cases should return the same result and you can choose the coding style depending on the situation. # case 1 format = :rdf entry.output(format) # case 2 entry.output_rdf You can also check entry.respond_to?(:output_rdf) in both cases. Toshiaki From pjotr.public14 at thebird.nl Wed Jan 20 07:36:44 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Wed, 20 Jan 2010 08:36:44 +0100 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp> References: <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> <20100116083041.GA2663@thebird.nl> <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> <20100117135441.GA24341@thebird.nl> <20100119105056.GA29525@thebird.nl> <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp> <20100119143422.GA1781@thebird.nl> <14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp> Message-ID: <20100120073644.GA11295@thebird.nl> Dear Toshiaki, On Wed, Jan 20, 2010 at 01:21:54AM +0900, Toshiaki Katayama wrote: > > I really would like all HTML to be in one sub-tree. Also XML, RDF and > > whatnot. When it is 'business' logic it should be in database. When it > > is output transformations it is not 'business' logic any longer. > > I'm not sure about HTML but FASTA and RDF, for example, are tightly > related to the original database format/contents. So, I proposed > to have methods to generate formatted string in each database class. > > There can be many ways to design OO class trees and to find the best > way to represent/abstract things is always a difficult task. I wrote a nice alignment HTML output generator. Which also displays PAML output. Currently it is in bio/output/html/htmlalignment.rb and the class is named Bio::Html::Alignment. For the current Bioruby, where do you want to put that? I don't feel it should be cluttering alignment.rb. Naohisa has suggested bio/alignment/output/html/alignment.rb instead. I feel uncomfortable with this. But it is kinda consistent with above, tightly relating it to the alignment object. What do you think of the class name? The code is in my color-alignment branch, see http://github.com/pjotrp/bioruby/tree/color-alignment Is anyone else interested in this type of discussion? We can take it off-list. Pj. From missy at be.to Wed Jan 20 09:17:50 2010 From: missy at be.to (MISHIMA, Hiroyuki) Date: Wed, 20 Jan 2010 18:17:50 +0900 Subject: [BioRuby] trouble on the FASTA.QUAL format (Bio::FastaNumericFormat) Message-ID: <4B56CA3E.8000905@be.to> Hi all, I am using BioRuby 1.4.0., and have a trouble in handling the FASTA.QUAL format using Bio::FastaNumericFormat. Please see the following code: ======================== require 'rubygems' require 'bio' FASTA_QUAL =<<'EOS' >SAMPLE1 30 30 29 42 EOS qual = Bio::FastaNumericFormat.new(FASTA_QUAL) bs = qual.to_biosequence puts bs.output(:raw) ========================= The last line raise an error: ========================= (eval):2:in `__get__seq': undefined method `seq' for # (NoMethodError) from (eval):4:in `seq' from /home/misshie/.gem/ruby/1.8/gems/bio-1.4.0/lib/bio/sequence/format_raw.rb:19:in `output' from /home/misshie/.gem/ruby/1.8/gems/bio-1.4.0/lib/bio/sequence/format.rb:97:in `output' from /home/misshie/.gem/ruby/1.8/gems/bio-1.4.0/lib/bio/sequence/format.rb:172:in `output' from fasta_numeric_format.rb:11 ========================= In the last line, using :fasta, :fasta_numeric etc. make same results. Please let me know if you have ideas to solve this problem. Hiro. -- MISHIMA, Hiroyuki, DDS, Ph.D. COE Research Fellow Department of Human Genetics Nagasaki University Graduate School of Biomedical Sciences From andrew.j.grimm at gmail.com Wed Jan 20 12:09:19 2010 From: andrew.j.grimm at gmail.com (Andrew Grimm) Date: Wed, 20 Jan 2010 23:09:19 +1100 Subject: [BioRuby] Thread-safety of alignment Message-ID: Is alignment intended to be thread-safe in bioruby? If so, should I use the same alignment factory between threads, or a separate one in each thread? Andrew From ngoto at gen-info.osaka-u.ac.jp Wed Jan 20 13:36:29 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 20 Jan 2010 22:36:29 +0900 Subject: [BioRuby] trouble on the FASTA.QUAL format (Bio::FastaNumericFormat) In-Reply-To: <4B56CA3E.8000905@be.to> References: <4B56CA3E.8000905@be.to> Message-ID: <20100120133630.052BF1CBC433@idnmail.gen-info.osaka-u.ac.jp> Hi, This is a bug, and will be fixed. Indeed, Bio::FastaNumericFormat does not contain sequence, and I forgot to take care about calling to_biosequence. For a workaroud, qual = Bio::FastaNumericFormat.new(FASTA_QUAL) bs = Bio::Sequence.new('') bs.quality_scores = qual.data puts bs.output(:fasta_numeric) Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Wed, 20 Jan 2010 18:17:50 +0900 "MISHIMA, Hiroyuki" wrote: > Hi all, > > I am using BioRuby 1.4.0., and have a trouble in handling the FASTA.QUAL > format using Bio::FastaNumericFormat. > > Please see the following code: > ======================== > require 'rubygems' > require 'bio' > > FASTA_QUAL =<<'EOS' > >SAMPLE1 > 30 30 29 42 > EOS > > qual = Bio::FastaNumericFormat.new(FASTA_QUAL) > bs = qual.to_biosequence > puts bs.output(:raw) > ========================= > > The last line raise an error: > > ========================= > (eval):2:in `__get__seq': undefined method `seq' for > # (NoMethodError) > from (eval):4:in `seq' > from > /home/misshie/.gem/ruby/1.8/gems/bio-1.4.0/lib/bio/sequence/format_raw.rb:19:in > `output' > from > /home/misshie/.gem/ruby/1.8/gems/bio-1.4.0/lib/bio/sequence/format.rb:97:in > `output' > from > /home/misshie/.gem/ruby/1.8/gems/bio-1.4.0/lib/bio/sequence/format.rb:172:in > `output' > from fasta_numeric_format.rb:11 > ========================= > > In the last line, using :fasta, :fasta_numeric etc. make same results. > > Please let me know if you have ideas to solve this problem. > > Hiro. > -- > MISHIMA, Hiroyuki, DDS, Ph.D. > COE Research Fellow > Department of Human Genetics > Nagasaki University Graduate School of Biomedical Sciences > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From ngoto at gen-info.osaka-u.ac.jp Wed Jan 20 13:50:45 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 20 Jan 2010 22:50:45 +0900 Subject: [BioRuby] Thread-safety of alignment In-Reply-To: References: Message-ID: <20100120135046.158711CBC511@idnmail.gen-info.osaka-u.ac.jp> Hi, On Wed, 20 Jan 2010 23:09:19 +1100 Andrew Grimm wrote: > Is alignment intended to be thread-safe in bioruby? If so, should I > use the same alignment factory between threads, or a separate one in > each thread? It is not confirmed to be thread-safe, so it is safe to use separate one in each thread. Currently, in BioRuby, manipulating the same object from different threads is not intended. When manipulating the same object from different threads is needed, using mutex is recommended. For library developers, it is encouraged to write thread-safe code if possible, but not mandatory. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > Andrew > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From ktym at hgc.jp Thu Jan 21 14:05:42 2010 From: ktym at hgc.jp (Toshiaki Katayama) Date: Thu, 21 Jan 2010 23:05:42 +0900 Subject: [BioRuby] Bioruby HTML output In-Reply-To: <20100120073644.GA11295@thebird.nl> References: <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp> <20100115140059.GA24948@thebird.nl> <4B515042.7020204@kenroku.kanazawa-u.ac.jp> <20100116083041.GA2663@thebird.nl> <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp> <20100117135441.GA24341@thebird.nl> <20100119105056.GA29525@thebird.nl> <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp> <20100119143422.GA1781@thebird.nl> <14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp> <20100120073644.GA11295@thebird.nl> Message-ID: <7B739736-1D0D-43E2-89E8-8F6B4DCC3404@hgc.jp> Dear Pj, I looked your code and had a feeling that we should use some template system. If HTML tags are hard coded in the library as you did, it will be very hard to modify them by the user. Besides, what version of the HTML specification did you have in mind? This is my first time to see the

tag is used in the form of

. Is it valid? I also think decorations should be separated to the CSS layer and you should avoid to use the tag, especially when you are trying to distribute your code as a part of the library. As for the file location, I still like the way Naohisa has suggested. Although, I'm not sure the internal node 'output/html' is necessary for 'bio/alignment/output/html/alignment.rb'. Anyway, we need to try every approach to learn pros and cons. With your proposal, we may have a tree like this: -------------------------------------------------- for bio/alignment.rb and bio/db/kegg/compound.rb and bio/db/genbank.rb ... bio/output/html/html_alignment.rb (Bio::Html::Alignment) bio/output/html/html_kegg_compound.rb (Bio::Html::KEGG::COMPOUND) bio/output/html/html_genbank.rb (Bio::Html::GenBank) : bio/output/rdf/rdf_kegg_compound.rb (Bio::RDF::KEGG::COMPOUND) bio/output/rdf/rdf_genbank.rb (Bio::RDF::GenBank) : bio/output/fasta/fasta_genbank.rb (Bio::FASTA::GenBank) bio/output/fasta/fasta_kegg_genes.rb (Bio::FASTA::KEGG::GENES) : bio/output/gff/gff_genbank.rb (Bio::GFF::GenBank) : -------------------------------------------------- apparently, the class names for output formats conflict with existing classes (e.g. Bio::FASTA, Bio::GFF) and we need to look into each sub directories to find which output format is supported for a particular database. If we gather templates of output formats along with the database classes: -------------------------------------------------- for bio/alignment.rb: bio/alignment/alignment.html.erb : for bio/db/kegg/compound.rb: bio/db/kegg/compound/compound.rdf.erb bio/db/kegg/compound/compound.tut.erb bio/db/kegg/compound/compound.html.erb : for bio/db/genbank.rb: bio/db/genbank/genbank.rdf.erb bio/db/genbank/genbank.gff.erb bio/db/genbank/genbank.html.erb bio/db/genbank/genbank.fasta.erb : -------------------------------------------------- However, this is still a desk plan and we need to try more (we already started for RDF). Toshiaki On 2010/01/20, at 16:36, Pjotr Prins wrote: > Dear Toshiaki, > > On Wed, Jan 20, 2010 at 01:21:54AM +0900, Toshiaki Katayama wrote: >>> I really would like all HTML to be in one sub-tree. Also XML, RDF and >>> whatnot. When it is 'business' logic it should be in database. When it >>> is output transformations it is not 'business' logic any longer. >> >> I'm not sure about HTML but FASTA and RDF, for example, are tightly >> related to the original database format/contents. So, I proposed >> to have methods to generate formatted string in each database class. >> >> There can be many ways to design OO class trees and to find the best >> way to represent/abstract things is always a difficult task. > > I wrote a nice alignment HTML output generator. Which also displays PAML > output. Currently it is in bio/output/html/htmlalignment.rb and the > class is named Bio::Html::Alignment. > > For the current Bioruby, where do you want to put that? I don't feel > it should be cluttering alignment.rb. Naohisa has suggested > bio/alignment/output/html/alignment.rb instead. I feel uncomfortable > with this. But it is kinda consistent with above, tightly relating it > to the alignment object. > > What do you think of the class name? > > The code is in my color-alignment branch, see > > http://github.com/pjotrp/bioruby/tree/color-alignment > > Is anyone else interested in this type of discussion? We can take it > off-list. > > Pj. From pjotr.public14 at thebird.nl Thu Jan 21 16:20:49 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Thu, 21 Jan 2010 17:20:49 +0100 Subject: [BioRuby] Proposal: Bioruby modules (the bazaar) Message-ID: <20100121162049.GB31462@thebird.nl> Dear Toshiaki, On Thu, Jan 21, 2010 at 11:05:42PM +0900, Toshiaki Katayama wrote: > I looked your code and had a feeling that we should use some > template system. If HTML tags are hard coded in the library as you > did, it will be very hard to modify them by the user. Aren't we trying to overcomplicate things? This is an HTML generator - in fact it is embedded HTML as I don't provide the , header or body parts. It can just be inserted into Rails, or whatever HTML framework that is out there. Templating is just another abstraction. I don't intend to template engines like Rails. Or, are you here merely referring to using the CGI class (or something like that). I guess I could do that, though I have trouble seeing the benefits. It is just another way of writing HTML statements. > Besides, what version of the HTML specification did you have in > mind? > This is my first time to see the

tag is used in the form of

. Is it valid? Yes. It is, in fact, XHTML. > I also think decorations should be separated to the CSS layer and you should avoid to use the tag, especially when you are trying to distribute your code as a part of the library. We use hard coded colors. I could use CSS, but then you need to provide a CSS file (or I need to hard code the header of the file). That makes it (again) more complicated than necessary. Where do we store the CSS file, how do we make sure the browser finds it? CSS is really to adapt look and feel. If the output is meant to be fixed, why make it flexible? Besides all (future) browsers support the font tag, as used. If that stops we could always adapt that source code. > As for the file location, I still like the way Naohisa has > suggested. Alright. I can move the files, if that was all. However, my colored alignment is not going to make it into Bioruby this way. There is always something wrong with my code, it appears. Now I need to move file locations that have not really been decided on; I need to template HTML - but we haven't decided how and it is questionable; I need to use CSS, though I think it makes things worse for users. Are we really sure you want to reject this code just because it does not live up to everyone's current and future expectations? It may still be useful to someone else, you know, it does not break anything else, and can be improved in the future. Once we decide what we want to achieve. The same really holds to my PAML branch and my GEO branch. Both contain useful utilities for others to use. And now the alignment is the third pending Bioruby branch. Can you imagine my growing frustration? Should this go into Bioruby, or should I start another project, like others have done? Or stick it into my existing biotools or bigbio projects? Just, so I don't have the hassle? The way the Perl people handle it is by having independent modules. Everyone owns his, or her, own module and Perl's CPAN acts more as an aggragator. The advantage is that the environment is more dynamic. And you really don't care what is inside a module. That is up to the maintainer and his/her users. We could create independent BioRuby modules, which have their own git repositories. When a module is nice enough to include in Bioruby make it a git submodule - I use this technique for biolib - it will register in the BioRuby repository. That way Bioruby still controls what goes in a release. However, modules can be maintained for experimental setups or private use. So my modules would go in lib/bio/modules/paml lib/bio/modules/geo lib/bio/modules/htmlalignment each its own git repository. When one of those is 'strong' enough for main line you move it into a different location in the main repository. Modules could even be included in Bioruby releases. What hurts me now is that no one is going to use my code, since I don't have the time to make it perfect, and it is hidden in my experimental Bioruby branches. We should find a way to make 'experimental code' available to the rest of the community. That way we may also 'recruit' help to make the code more perfect. Make it easy to allow external modules to become visible through Bioruby - that is a win-win, as well as a more bazaar-like approach to OSS development. I wonder how many people on this list would contribute code if it was more loosely organised. Pj. From ktym at hgc.jp Thu Jan 21 17:54:24 2010 From: ktym at hgc.jp (Toshiaki Katayama) Date: Fri, 22 Jan 2010 02:54:24 +0900 Subject: [BioRuby] Proposal: Bioruby modules (the bazaar) In-Reply-To: <20100121162049.GB31462@thebird.nl> References: <20100121162049.GB31462@thebird.nl> Message-ID: Dear Pj, I can understand your frustration and I like your idea of the 'module' system, as it reminds me the way how the Linux kernel tree is successfully maintained. > I wonder how many people on this list would contribute code if it was > more loosely organised. Indeed. However, I think our move from cvs to git was already a great step that it opened large opportunity to all those who want to participate in development. Before doing that, "open source" project not always mean "open to join" project. Now, everyone can easily fork the project and release their modified codes as you already done. So, we may able to evaluate from the current situation that how many other people have tried. Anyway, it is still a difficult problem that who will decide and how to decide when to migrate the contributed code into the main tree. It might sound like a excuse, but I'm also suffering from the difficulty. I also have several modules which are not yet contributed to the main tree. For example, my SGE library for BioRuby (http://kanehisa.hgc.jp/~k/sge/) because I'm not sure it is general enough and where it fits. As for the HTML portion, I see your point. * I'd like to hear comments from others. * How people like to render/visualize the BioRuby objects (especially in HTML)? * I didn't mean to use the CGI class for HTML generation (I even don't like that). * The use of

seems invalid in XHTML. See http://www.w3.org/TR/xhtml1/#C_3 P.S. Once, I had developed a mechanism to integrate end-user code snippets in the BioRuby shell, called plugins. I wrote some plugins which render a colored codon table, a formatted summary of sequence properties etc. If those and functions defined in your plugins can be easily accessed by puts Bio.your_function_name(options) or something like that, is it satisfy your needs? If so, we can consider to make a repository for such plugins and bundle them in the BioRuby as well. Regards, Toshiaki Katayama On 2010/01/22, at 1:20, Pjotr Prins wrote: > Dear Toshiaki, > > On Thu, Jan 21, 2010 at 11:05:42PM +0900, Toshiaki Katayama wrote: >> I looked your code and had a feeling that we should use some >> template system. If HTML tags are hard coded in the library as you >> did, it will be very hard to modify them by the user. > > Aren't we trying to overcomplicate things? This is an HTML generator > - in fact it is embedded HTML as I don't provide the , header or > body parts. It can just be inserted into Rails, or whatever HTML > framework that is out there. > > Templating is just another abstraction. I don't intend to template > engines like Rails. > > Or, are you here merely referring to using the CGI class (or something > like that). I guess I could do that, though I have trouble seeing the > benefits. It is just another way of writing HTML statements. > >> Besides, what version of the HTML specification did you have in >> mind? >> This is my first time to see the

tag is used in the form of

. Is it valid? > > Yes. It is, in fact, XHTML. > >> I also think decorations should be separated to the CSS layer and you should avoid to use the tag, especially when you are trying to distribute your code as a part of the library. > > We use hard coded colors. I could use CSS, but then you need to > provide a CSS file (or I need to hard code the header of the file). > That makes it (again) more complicated than necessary. Where do we > store the CSS file, how do we make sure the browser finds it? CSS is > really to adapt look and feel. If the output is meant to be fixed, why > make it flexible? Besides all (future) browsers support the font tag, > as used. If that stops we could always adapt that source code. > >> As for the file location, I still like the way Naohisa has >> suggested. > > Alright. I can move the files, if that was all. > > However, my colored alignment is not going to make it into Bioruby > this way. There is always something wrong with my code, it appears. > Now I need to move file locations that have not really been decided > on; I need to template HTML - but we haven't decided how and it is > questionable; I need to use CSS, though I think it makes things worse > for users. > > Are we really sure you want to reject this code just because it does > not live up to everyone's current and future expectations? It may > still be useful to someone else, you know, it does not break anything > else, and can be improved in the future. Once we decide what we want > to achieve. > > The same really holds to my PAML branch and my GEO branch. Both > contain useful utilities for others to use. And now the alignment is > the third pending Bioruby branch. > > Can you imagine my growing frustration? Should this go into Bioruby, > or should I start another project, like others have done? Or stick it > into my existing biotools or bigbio projects? Just, so I don't have > the hassle? > > The way the Perl people handle it is by having independent modules. > Everyone owns his, or her, own module and Perl's CPAN acts more as an > aggragator. The advantage is that the environment is more dynamic. And > you really don't care what is inside a module. That is up to the > maintainer and his/her users. > > We could create independent BioRuby modules, which have their own git > repositories. When a module is nice enough to include in Bioruby make > it a git submodule - I use this technique for biolib - it will > register in the BioRuby repository. That way Bioruby still controls > what goes in a release. However, modules can be maintained for > experimental setups or private use. So my modules would go in > > lib/bio/modules/paml > lib/bio/modules/geo > lib/bio/modules/htmlalignment > > each its own git repository. > > When one of those is 'strong' enough for main line you move it into a > different location in the main repository. Modules could even be > included in Bioruby releases. > > What hurts me now is that no one is going to use my code, since I > don't have the time to make it perfect, and it is hidden in my > experimental Bioruby branches. We should find a way to make > 'experimental code' available to the rest of the community. That way > we may also 'recruit' help to make the code more perfect. > > Make it easy to allow external modules to become visible through > Bioruby - that is a win-win, as well as a more bazaar-like approach > to OSS development. > > I wonder how many people on this list would contribute code if it was > more loosely organised. > > Pj. From yannick.wurm at unil.ch Thu Jan 21 18:21:40 2010 From: yannick.wurm at unil.ch (Yannick Wurm) Date: Thu, 21 Jan 2010 19:21:40 +0100 Subject: [BioRuby] Proposal: Bioruby modules (the bazaar) In-Reply-To: References: Message-ID: On 21 Jan 2010, at 18:00, bioruby-request at lists.open-bio.org wrote: > re we really sure you want to reject this code just because it does > not live up to everyone's current and future expectations? It may > still be useful to someone else, you know, it does not break anything > else, and can be improved in the future. Once we decide what we want > to achieve. > > What hurts me now is that no one is going to use my code, since I > don't have the time to make it perfect, and it is hidden in my > experimental Bioruby branches. We should find a way to make > 'experimental code' available to the rest of the community. That way > we may also 'recruit' help to make the code more perfect. I agree 100% that enthusiastic bioruby improvements like Pjotr's should be encouraged & given maximal visibility. It's better to have great tools with room for improvement than no tools. (a year or two ago I needed colored html alignments and ended up with an ugly, ugly hack that used t_coffee to generate html output from the alignments I'd generated elsewhere - something like Pjotr's code would have been much more elegant) I also have the feeling that code contributions in general are given more negative than positive feedback on this list. I believe it's a grave mistake because the bioruby community will not grow without passionate users & contibutors and more quality code. just my two cents, yannick -------------------------------------------- yannick . wurm @ unil . ch Ant Genomics, Ecology & Evolution @ Lausanne http://www.unil.ch/dee/page28685_fr.html From pjotr.public14 at thebird.nl Fri Jan 22 08:55:08 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Fri, 22 Jan 2010 09:55:08 +0100 Subject: [BioRuby] Proposal: Bioruby modules (the bazaar) In-Reply-To: References: <20100121162049.GB31462@thebird.nl> Message-ID: <20100122085508.GB12248@thebird.nl> On Fri, Jan 22, 2010 at 02:54:24AM +0900, Toshiaki Katayama wrote: > Dear Pj, > > I can understand your frustration and I like your idea of the > 'module' system, as it reminds me the way how the Linux kernel > tree is successfully maintained. Thinking about it there are other good examples. The R language supports modules in CRAN - similar in many ways to generic Perl CPAN and Ruby's gems. But, on top of CRAN they also have Bioconductor which aggregates Bio related modules. The main benefit is that it pre-packages all Bio related packages and people can load it on the fly. See http://www.bioconductor.org/ We don't want to replace gems - but I think the gem system is too loose for most people, and it requires every module to understand and comply with the gem system. I think Bioruby can play a role here. We can have modules (or plugins, like Rails has) that come either with Bioruby's installation, or get installed on request. If we find a syntax for that it would be great. E.g. Bio::Module.load(:html_alignment) If it is part of Bioruby, pass. Otherwise throw error: "Bio::Module :html_alignment not installed, try Bio::Module.install(:html_alignment)" Bio::Module.install(:html_alignment) will search the definition and install it. Depending on the module it can be installed as a gem, or fetched through git or a tarball (an optional parameter can overrule behaviour). On success one can start as either function will prepare for: html_aln = Bio::Html::Alignment.new('my.aln') The nice thing about this setup is that (1) It is really easy on the user (2) Decouples the module from Bioruby - all issues are between the users and the module maintainer - discussions can still be on the main mailing list (3) Retains some control on what modules are allowed in, an what not (4) Modules can be obsoleted (5) Modules can be updated outside Bioruby's mainline. e.g. Bio::Module.install(:html_alignment,:development=>true) Pj. From tomoakin at kenroku.kanazawa-u.ac.jp Fri Jan 22 09:12:29 2010 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Fri, 22 Jan 2010 18:12:29 +0900 Subject: [BioRuby] Proposal: Bioruby modules (the bazaar) In-Reply-To: References: <20100121162049.GB31462@thebird.nl> Message-ID: <066BB141-7217-4343-85B4-165072A58E06@kenroku.kanazawa-u.ac.jp> Hi, > As for the HTML portion, I see your point. > > * I'd like to hear comments from others. > * How people like to render/visualize the BioRuby objects > (especially in HTML)? > * I didn't mean to use the CGI class for HTML generation (I even > don't like that). Perhaps the way to render the objects depends on both objects and purposes, but if the object has a string representation, just showing them is perhaps a good default. Also defining the way how to represent in HTML or any other format for all classes comprehensively is too laborious as the first step and a way to allow gradual growth of the codebase seems good. It is the way flatfile parser grew to support many formats. Thus, mechanism to do class specific conversion and default conversion for non HTML aware classes is good. Criticism on 'cgi' library for the default conversion CGI.escapeHTML(object.to_s), especially for the name is understandable. There are already criticism on CGI.rb in itself but there are no *standard* alternatives yet. Perhaps we can just copy or rewrite the escapeHTML code and make it any name that fits our purpose. A drawback of having our escapeHTML code is that it could be redundant in many cases where html generation is for CGI, and we cannot get benefit from CGIAlt or any other compatible speedup library on CGI, rewrite or extension with C. But I think this is not a very large problem. Making require 'bio' automatically loading cgi.rb is undesirable. If the html code is not automatically loaded by require 'bio' but loaded only another call require 'bio/html', then I feel 'bio/html' loading cgi.rb is in a reasonable range. Capability to use style instead of directly specifying color and font is desirable since it could reduce the output size, and possibly readability. Nontheless, this is not mandatory and the first implementation with direct specification is ok. -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan On 2010/01/22, at 2:54, Toshiaki Katayama wrote: > Dear Pj, > > I can understand your frustration and I like your idea of the > 'module' system, as it reminds me the way how the Linux kernel > tree is successfully maintained. > >> I wonder how many people on this list would contribute code if it was >> more loosely organised. > > Indeed. > > However, I think our move from cvs to git was already a great step > that it opened large opportunity to all those who want to participate > in development. Before doing that, "open source" project not always > mean "open to join" project. > > Now, everyone can easily fork the project and release their modified > codes as you already done. So, we may able to evaluate from the > current > situation that how many other people have tried. > > Anyway, it is still a difficult problem that who will decide and > how to decide when to migrate the contributed code into the main tree. > It might sound like a excuse, but I'm also suffering from the > difficulty. > I also have several modules which are not yet contributed to the > main tree. > For example, my SGE library for BioRuby (http://kanehisa.hgc.jp/~k/ > sge/) > because I'm not sure it is general enough and where it fits. > > > As for the HTML portion, I see your point. > > * I'd like to hear comments from others. > * How people like to render/visualize the BioRuby objects > (especially in HTML)? > * I didn't mean to use the CGI class for HTML generation (I even > don't like that). > * The use of

seems invalid in XHTML. See http://www.w3.org/TR/ > xhtml1/#C_3 > > > P.S. > Once, I had developed a mechanism to integrate end-user code snippets > in the BioRuby shell, called plugins. I wrote some plugins which > render > a colored codon table, a formatted summary of sequence properties etc. > > If those and functions defined in your plugins can be easily > accessed by > > puts Bio.your_function_name(options) > > or something like that, is it satisfy your needs? > > If so, we can consider to make a repository for such plugins and > bundle > them in the BioRuby as well. > > Regards, > Toshiaki Katayama > > > On 2010/01/22, at 1:20, Pjotr Prins wrote: > >> Dear Toshiaki, >> >> On Thu, Jan 21, 2010 at 11:05:42PM +0900, Toshiaki Katayama wrote: >>> I looked your code and had a feeling that we should use some >>> template system. If HTML tags are hard coded in the library as you >>> did, it will be very hard to modify them by the user. >> >> Aren't we trying to overcomplicate things? This is an HTML generator >> - in fact it is embedded HTML as I don't provide the , >> header or >> body parts. It can just be inserted into Rails, or whatever HTML >> framework that is out there. >> >> Templating is just another abstraction. I don't intend to template >> engines like Rails. >> >> Or, are you here merely referring to using the CGI class (or >> something >> like that). I guess I could do that, though I have trouble seeing >> the >> benefits. It is just another way of writing HTML statements. >> >>> Besides, what version of the HTML specification did you have in >>> mind? >>> This is my first time to see the

tag is used in the form of >>>

. Is it valid? >> >> Yes. It is, in fact, XHTML. >> >>> I also think decorations should be separated to the CSS layer and >>> you should avoid to use the tag, especially when you are >>> trying to distribute your code as a part of the library. >> >> We use hard coded colors. I could use CSS, but then you need to >> provide a CSS file (or I need to hard code the header of the file). >> That makes it (again) more complicated than necessary. Where do we >> store the CSS file, how do we make sure the browser finds it? CSS is >> really to adapt look and feel. If the output is meant to be fixed, >> why >> make it flexible? Besides all (future) browsers support the font >> tag, >> as used. If that stops we could always adapt that source code. >> >>> As for the file location, I still like the way Naohisa has >>> suggested. >> >> Alright. I can move the files, if that was all. >> >> However, my colored alignment is not going to make it into Bioruby >> this way. There is always something wrong with my code, it appears. >> Now I need to move file locations that have not really been decided >> on; I need to template HTML - but we haven't decided how and it is >> questionable; I need to use CSS, though I think it makes things worse >> for users. >> >> Are we really sure you want to reject this code just because it does >> not live up to everyone's current and future expectations? It may >> still be useful to someone else, you know, it does not break anything >> else, and can be improved in the future. Once we decide what we want >> to achieve. >> >> The same really holds to my PAML branch and my GEO branch. Both >> contain useful utilities for others to use. And now the alignment is >> the third pending Bioruby branch. >> >> Can you imagine my growing frustration? Should this go into Bioruby, >> or should I start another project, like others have done? Or stick it >> into my existing biotools or bigbio projects? Just, so I don't have >> the hassle? >> >> The way the Perl people handle it is by having independent modules. >> Everyone owns his, or her, own module and Perl's CPAN acts more as an >> aggragator. The advantage is that the environment is more dynamic. >> And >> you really don't care what is inside a module. That is up to the >> maintainer and his/her users. >> >> We could create independent BioRuby modules, which have their own git >> repositories. When a module is nice enough to include in Bioruby make >> it a git submodule - I use this technique for biolib - it will >> register in the BioRuby repository. That way Bioruby still controls >> what goes in a release. However, modules can be maintained for >> experimental setups or private use. So my modules would go in >> >> lib/bio/modules/paml >> lib/bio/modules/geo >> lib/bio/modules/htmlalignment >> >> each its own git repository. >> >> When one of those is 'strong' enough for main line you move it into a >> different location in the main repository. Modules could even be >> included in Bioruby releases. >> >> What hurts me now is that no one is going to use my code, since I >> don't have the time to make it perfect, and it is hidden in my >> experimental Bioruby branches. We should find a way to make >> 'experimental code' available to the rest of the community. That way >> we may also 'recruit' help to make the code more perfect. >> >> Make it easy to allow external modules to become visible through >> Bioruby - that is a win-win, as well as a more bazaar-like approach >> to OSS development. >> >> I wonder how many people on this list would contribute code if it was >> more loosely organised. >> >> Pj. > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From jan.aerts at gmail.com Fri Jan 22 09:34:43 2010 From: jan.aerts at gmail.com (Jan Aerts) Date: Fri, 22 Jan 2010 09:34:43 +0000 Subject: [BioRuby] Proposal: Bioruby modules (the bazaar) In-Reply-To: References: Message-ID: <4c7507a71001220134j3eecf626y90755ddd919336e4@mail.gmail.com> Hear, hear... Exactly my feelings as well. j. 2010/1/21 Yannick Wurm > On 21 Jan 2010, at 18:00, bioruby-request at lists.open-bio.org wrote: > > > re we really sure you want to reject this code just because it does > > not live up to everyone's current and future expectations? It may > > still be useful to someone else, you know, it does not break anything > > else, and can be improved in the future. Once we decide what we want > > to achieve. > > > > > What hurts me now is that no one is going to use my code, since I > > don't have the time to make it perfect, and it is hidden in my > > experimental Bioruby branches. We should find a way to make > > 'experimental code' available to the rest of the community. That way > > we may also 'recruit' help to make the code more perfect. > > > I agree 100% that enthusiastic bioruby improvements like Pjotr's should be > encouraged & given maximal visibility. > It's better to have great tools with room for improvement than no tools. > (a year or two ago I needed colored html alignments and ended up with an > ugly, ugly hack that used t_coffee to generate html output from the > alignments I'd generated elsewhere - something like Pjotr's code would have > been much more elegant) > > I also have the feeling that code contributions in general are given more > negative than positive feedback on this list. I believe it's a grave mistake > because the bioruby community will not grow without passionate users & > contibutors and more quality code. > > just my two cents, > > yannick > > -------------------------------------------- > yannick . wurm @ unil . ch > Ant Genomics, Ecology & Evolution @ Lausanne > http://www.unil.ch/dee/page28685_fr.html > > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From tomoakin at kenroku.kanazawa-u.ac.jp Fri Jan 22 09:48:20 2010 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Fri, 22 Jan 2010 18:48:20 +0900 Subject: [BioRuby] Proposal: Bioruby modules (the bazaar) In-Reply-To: <20100122085508.GB12248@thebird.nl> References: <20100121162049.GB31462@thebird.nl> <20100122085508.GB12248@thebird.nl> Message-ID: <1B20D151-8246-47B1-8D3B-F319D66FF92F@kenroku.kanazawa-u.ac.jp> Hi, > Bio::Module.load(:html_alignment) What is the benefit over require 'bio/html_alignment' # no autoload by require 'bio' ? > Bio::Module.install(:html_alignment) > > will search the definition and install it. I feel installation is easier from shell like: $ ruby bioruby-inst-module html_alignment but calling the Module.install internally is fine. > (5) Modules can be updated outside Bioruby's mainline. e.g. > Bio::Module.install(:html_alignment,:development=>true) We need to have a mechanism to check the versions between the standard bioruby and the modules. Especially when the mainline bioruby is updated. Different modules perhaps will have different level of dependency on the bioruby code, and update in the main bioruby code sometimes may break the old module. -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan From pjotr.public14 at thebird.nl Fri Jan 22 10:49:00 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Fri, 22 Jan 2010 11:49:00 +0100 Subject: [BioRuby] Proposal: Bioruby modules (the bazaar) In-Reply-To: <1B20D151-8246-47B1-8D3B-F319D66FF92F@kenroku.kanazawa-u.ac.jp> References: <20100121162049.GB31462@thebird.nl> <20100122085508.GB12248@thebird.nl> <1B20D151-8246-47B1-8D3B-F319D66FF92F@kenroku.kanazawa-u.ac.jp> Message-ID: <20100122104900.GB15628@thebird.nl> On Fri, Jan 22, 2010 at 06:48:20PM +0900, Tomoaki NISHIYAMA wrote: >> Bio::Module.load(:html_alignment) > > What is the benefit over > require 'bio/html_alignment' # no autoload by require 'bio' > ? A method allows more checking. I presume the module information will be somewhere in a YAML file in the main tree. Or maintained through git submodules. >> Bio::Module.install(:html_alignment) >> >> will search the definition and install it. > > I feel installation is easier from shell like: > $ ruby bioruby-inst-module html_alignment > but calling the Module.install internally is fine. My example is for an interactive session. You only do it once (I hope). Or when an author says he has updated his module. >> (5) Modules can be updated outside Bioruby's mainline. e.g. >> Bio::Module.install(:html_alignment,:development=>true) > > We need to have a mechanism to check the versions between > the standard bioruby and the modules. Especially when the > mainline bioruby is updated. Different modules perhaps will > have different level of dependency on the bioruby code, and > update in the main bioruby code sometimes may break the old > module. Well. Bioruby should not care. I think you misunderstand the purpose. Modules are *not* to be supported from Bioruby. It is only a mechanism to make them easily available. If things break, they break. That is why it is developmental, or experimental. The modules that are well 'supported' will come inside the distribution. Outside modules are up to the module maintainer. Besides, you don't want to replace gems. If an author wants versioning he can provide a gem (which, again, can be loaded as a Bioruby module). Once a module goes main stream versioning is moot. It just becomes part of the Bioruby tree. When everyone understands this a module can still support versioning. But I think that ought to be done through gems. Pj. From andrew.j.grimm at gmail.com Tue Jan 26 12:12:35 2010 From: andrew.j.grimm at gmail.com (Andrew Grimm) Date: Tue, 26 Jan 2010 23:12:35 +1100 Subject: [BioRuby] Thread-safety of alignment In-Reply-To: <20100120135046.158711CBC511@idnmail.gen-info.osaka-u.ac.jp> References: <20100120135046.158711CBC511@idnmail.gen-info.osaka-u.ac.jp> Message-ID: Hi Naohisa Goto, I tried creating a new factory in each thread, but I sometimes (but not always) have errors. Is the code in http://github.com/agrimm/bioruby-alignment-threading-replication/blob/master/test/test_multithreaded_alignment.rb correct? Does it cause problems for anyone else? Some of the errors I get include the ones seen at http://gist.github.com/286775 It's possible that the issues are caused by problems in tempfile itself (which may have been fixed in August 2009 according to the changelog). Thanks, Andrew On Thu, Jan 21, 2010 at 12:50 AM, Naohisa GOTO wrote: > Hi, > > On Wed, 20 Jan 2010 23:09:19 +1100 > Andrew Grimm wrote: > >> Is alignment intended to be thread-safe in bioruby? If so, should I >> use the same alignment factory between threads, or a separate one in >> each thread? > > It is not confirmed to be thread-safe, so it is safe to use > separate one in each thread. > > Currently, in BioRuby, manipulating the same object from different > threads is not intended. When manipulating the same object from > different threads is needed, using mutex is recommended. > > For library developers, it is encouraged to write thread-safe > code if possible, but not mandatory. > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > >> >> Andrew >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > From ngoto at gen-info.osaka-u.ac.jp Tue Jan 26 15:00:04 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 27 Jan 2010 00:00:04 +0900 Subject: [BioRuby] Thread-safety of alignment In-Reply-To: References: <20100120135046.158711CBC511@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20100126150004.8B4AA1CBC3EC@idnmail.gen-info.osaka-u.ac.jp> Hi Andrew, On Tue, 26 Jan 2010 23:12:35 +1100 Andrew Grimm wrote: > Hi Naohisa Goto, > > I tried creating a new factory in each thread, but I sometimes (but > not always) have errors. Please show ruby version and BioRuby version. % ruby -v % ruby -rbio -e 'puts Bio::BIORUBY_VERSION_ID' (If you are using BioRuby 1.2.1 or earlier, % ruby -rbio -e 'p Bio::BIORUBY_VERSION' ) > Is the code in http://github.com/agrimm/bioruby-alignment-threading-replication/blob/master/test/test_multithreaded_alignment.rb > correct? Does it cause problems for anyone else? The "rescue RuntimeError" in line 15 may hide problems. In my environment, it seems that the RuntimeError is raised in lib/bio/alignment.rb. The error message I observed without the rescue was "alignment result is inconsistent with input data", and output file created by Clustalw was unexpectedly empty. It might be a bug of Tempfile in Ruby, but not sure. With Ruby 1.8.7, errors are observed in some times. % ruby -v ruby 1.8.7 (2010-01-10 patchlevel 249) [i686-linux] ruby 1.8.7 (2009-04-08 patchlevel 160) [i686-linux] ruby 1.8.7 (2008-08-11 patchlevel 72) [i486-linux] With Ruby 1.9.1-p378, no errors when I executed several times. % ruby -v ruby 1.9.1p378 (2010-01-10 revision 26273) [i686-linux] > Some of the errors I get include the ones seen at http://gist.github.com/286775 The message "ERROR: Multiple sequences found with same name (found 0 at least twice)!" is reported by ClustalW, and it indicates incorrect input file sequence names. Maybe two file contents are unexpectedly concatenated or mixed possibly due to a bug of Tempfile, but not sure. > It's possible that the issues are caused by problems in tempfile > itself (which may have been fixed in August 2009 according to the > changelog). Another possibility is resource limits of the machine: the number of child processes, total memory size, etc. If exceeding limits, new child clustalw process could not be started, or running clustalw processes might be killed. This also causes void or truncated result files, and leads to ruby-level errors. Thanks, Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > Thanks, > > Andrew > > On Thu, Jan 21, 2010 at 12:50 AM, Naohisa GOTO > wrote: > > Hi, > > > > On Wed, 20 Jan 2010 23:09:19 +1100 > > Andrew Grimm wrote: > > > >> Is alignment intended to be thread-safe in bioruby? If so, should I > >> use the same alignment factory between threads, or a separate one in > >> each thread? > > > > It is not confirmed to be thread-safe, so it is safe to use > > separate one in each thread. > > > > Currently, in BioRuby, manipulating the same object from different > > threads is not intended. When manipulating the same object from > > different threads is needed, using mutex is recommended. > > > > For library developers, it is encouraged to write thread-safe > > code if possible, but not mandatory. > > > > Naohisa Goto > > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > > >> > >> Andrew > >> _______________________________________________ > >> BioRuby Project - http://www.bioruby.org/ > >> BioRuby mailing list > >> BioRuby at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > > From andrew.j.grimm at gmail.com Wed Jan 27 03:07:18 2010 From: andrew.j.grimm at gmail.com (Andrew Grimm) Date: Wed, 27 Jan 2010 14:07:18 +1100 Subject: [BioRuby] Thread-safety of alignment In-Reply-To: <20100126150004.8B4AA1CBC3EC@idnmail.gen-info.osaka-u.ac.jp> References: <20100120135046.158711CBC511@idnmail.gen-info.osaka-u.ac.jp> <20100126150004.8B4AA1CBC3EC@idnmail.gen-info.osaka-u.ac.jp> Message-ID: Hi Naohisa Goto, On Wed, Jan 27, 2010 at 2:00 AM, Naohisa GOTO wrote: > Hi Andrew, > > On Tue, 26 Jan 2010 23:12:35 +1100 > Andrew Grimm wrote: > >> Hi Naohisa Goto, >> >> I tried creating a new factory in each thread, but I sometimes (but >> not always) have errors. > > Please show ruby version and BioRuby version. > ?% ruby -v > ?% ruby -rbio -e 'puts Bio::BIORUBY_VERSION_ID' > (If you are using BioRuby 1.2.1 or earlier, > ?% ruby -rbio -e 'p Bio::BIORUBY_VERSION' > ) > I'm running ruby 1.8.7 (2008-08-11 patchlevel 72) and bioruby 1.4.0. >> Is the code in http://github.com/agrimm/bioruby-alignment-threading-replication/blob/master/test/test_multithreaded_alignment.rb >> correct? Does it cause problems for anyone else? > > The "rescue RuntimeError" in line 15 may hide problems. > In my environment, it seems that the RuntimeError is raised > in lib/bio/alignment.rb. The error message I observed > without the rescue was > "alignment result is inconsistent with input data", > and output file created by Clustalw was unexpectedly empty. > It might be a bug of Tempfile in Ruby, but not sure. > > With Ruby 1.8.7, errors are observed in some times. > ?% ruby -v > ?ruby 1.8.7 (2010-01-10 patchlevel 249) [i686-linux] > ?ruby 1.8.7 (2009-04-08 patchlevel 160) [i686-linux] > ?ruby 1.8.7 (2008-08-11 patchlevel 72) [i486-linux] > > With Ruby 1.9.1-p378, no errors when I executed several times. > ?% ruby -v > ?ruby 1.9.1p378 (2010-01-10 revision 26273) [i686-linux] > I suspect errors may occur on earlier versions of ruby 1.9.1. >> Some of the errors I get include the ones seen at http://gist.github.com/286775 > > The message "ERROR: Multiple sequences found with same name > (found 0 at least twice)!" is reported by ClustalW, and > it indicates incorrect input file sequence names. Maybe > two file contents are unexpectedly concatenated or mixed > possibly due to a bug of Tempfile, but not sure. > >> It's possible that the issues are caused by problems in tempfile >> itself (which may have been fixed in August 2009 according to the >> changelog). > > Another possibility is resource limits of the machine: > the number of child processes, total memory size, etc. > If exceeding limits, new child clustalw process could > not be started, or running clustalw processes might be > killed. This also causes void or truncated result files, > and leads to ruby-level errors. > Thanks for that suggestion. I re-ran the test using only 5 threads in the new gist http://gist.github.com/287499 > Thanks, > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > >> >> Thanks, >> >> Andrew >> >> On Thu, Jan 21, 2010 at 12:50 AM, Naohisa GOTO >> wrote: >> > Hi, >> > >> > On Wed, 20 Jan 2010 23:09:19 +1100 >> > Andrew Grimm wrote: >> > >> >> Is alignment intended to be thread-safe in bioruby? If so, should I >> >> use the same alignment factory between threads, or a separate one in >> >> each thread? >> > >> > It is not confirmed to be thread-safe, so it is safe to use >> > separate one in each thread. >> > >> > Currently, in BioRuby, manipulating the same object from different >> > threads is not intended. When manipulating the same object from >> > different threads is needed, using mutex is recommended. >> > >> > For library developers, it is encouraged to write thread-safe >> > code if possible, but not mandatory. >> > >> > Naohisa Goto >> > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org >> > >> >> >> >> Andrew >> >> _______________________________________________ >> >> BioRuby Project - http://www.bioruby.org/ >> >> BioRuby mailing list >> >> BioRuby at lists.open-bio.org >> >> http://lists.open-bio.org/mailman/listinfo/bioruby >> > >> > > > From missy at be.to Fri Jan 29 06:46:15 2010 From: missy at be.to (MISHIMA, Hiroyuki) Date: Fri, 29 Jan 2010 15:46:15 +0900 Subject: [BioRuby] Proposal: Bio::FastaFormat#each_entry Message-ID: <4B628437.30305@be.to> Hi all, How about implementing the following methods? Bio::FastaFormat#each_entry Bio::FastaNumericFormat#each_entry The following is a sample code to generate a FASTQ string from a FASTA string and a FASTA.QUAL string. This sample may need ruby 1.8.7 or later. I am afraid that simpler or easier ways are already existed in BioRuby... Hiro. ----- #!/usr/local/bin/ruby require 'rubygems' require 'bio' module Bio class FastaFormat def each_entry return to_enum(:each_entry) unless block_given? @continue = self.dup loop do yield @continue overrun = @continue.entry_overrun break unless overrun @continue = Bio::FastaFormat.new(overrun) end end end class FastaNumericFormat def each_entry return to_enum(:each_entry) unless block_given? @continue = self.dup loop do yield @continue overrun = @continue.entry_overrun break unless overrun @continue = Bio::FastaNumericFormat.new(overrun) end end end end fasta = <FXQB1I00000001 TATGGAATCTGTAGAATCAGTGGTAGGTGCAGCAGATGGAGGAAGG >FXQB1I00000002 CTGGAGAATTCTGGATCCTCGACTTATGACTTGGTGGTTCTGGTAACTGTGAGCTTAGGATAGTCAG EOS qual = <FXQB1I00000001 30 30 29 42 25 24 5 30 30 30 30 30 28 30 26 9 30 30 30 30 30 42 25 30 30 42 25 29 22 30 29 26 30 30 30 29 30 42 25 30 32 17 40 23 39 24 >FXQB1I00000002 30 30 33 19 28 30 26 9 32 12 30 30 33 20 30 30 32 15 27 27 30 28 28 34 22 27 22 28 28 29 26 9 33 19 22 43 25 33 19 28 27 32 15 30 32 12 28 30 27 30 30 26 27 30 40 23 30 40 23 30 29 29 30 30 30 29 30 EOS enum_fasta = Bio::FastaFormat.new(fasta).each_entry enum_qual = Bio::FastaNumericFormat.new(qual).each_entry loop do fastq = Bio::Sequence.adapter(enum_fasta.next, Bio::Sequence::Adapter::Fastq) fastq.quality_score_type = :phred fastq.quality_scores = enum_qual.next.data puts fastq.output(:fastq) end -- MISHIMA, Hiroyuki, DDS, Ph.D. COE Research Fellow Department of Human Genetics Nagasaki University Graduate School of Biomedical Sciences From ngoto at gen-info.osaka-u.ac.jp Fri Jan 29 10:25:29 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Fri, 29 Jan 2010 19:25:29 +0900 Subject: [BioRuby] Proposal: Bio::FastaFormat#each_entry In-Reply-To: <4B628437.30305@be.to> References: <4B628437.30305@be.to> Message-ID: <20100129102530.3BBEA1CBC436@idnmail.gen-info.osaka-u.ac.jp> Hi, On Fri, 29 Jan 2010 15:46:15 +0900 "MISHIMA, Hiroyuki" wrote: > Hi all, > > How about implementing the following methods? > > Bio::FastaFormat#each_entry > Bio::FastaNumericFormat#each_entry > > The following is a sample code to generate a FASTQ string from a FASTA > string and a FASTA.QUAL string. This sample may need ruby 1.8.7 or later. > > I am afraid that simpler or easier ways are already existed in BioRuby... I think mixing single entry parser with multiple entry iterator will cause confusion, and not good way. For most parser classes in bioruby, expected data source is String containing single entry data. In addition, for IO with possible multiple entries, Bio::FlatFile is the front-end that can detect data type, splits each entry, and calling assigned parser class. For String containing multiple entries, using StringIO and then Bio::FlatFile is the easiest way, although indirect. Recently, many efficient memory-mapped data transfer methods are available, e.g. memcached, IPC shared memory, mmap(2) system call. I'm now thinking how to treat such data efficiently. Below is an example using StringIO and Bio::FlatFile. #------------------------------------------------ require 'stringio' require 'bio' # When copy-and paste this script, the "> " in the head of # each line should be removed. > fasta = < >FXQB1I00000001 > TATGGAATCTGTAGAATCAGTGGTAGGTGCAGCAGATGGAGGAAGG > >FXQB1I00000002 > CTGGAGAATTCTGGATCCTCGACTTATGACTTGGTGGTTCTGGTAACTGTGAGCTTAGGATAGTCAG > EOS > > qual = < >FXQB1I00000001 > 30 30 29 42 25 24 5 30 30 30 30 30 28 30 26 9 30 30 30 30 30 42 25 30 30 > 42 25 29 22 30 29 26 30 30 30 29 30 42 25 30 32 17 40 23 39 24 > >FXQB1I00000002 > 30 30 33 19 28 30 26 9 32 12 30 30 33 20 30 30 32 15 27 27 30 28 28 34 > 22 27 22 28 28 29 26 9 33 19 22 43 25 33 19 28 27 32 15 30 32 12 28 30 > 27 30 30 26 27 30 40 23 30 40 23 30 29 29 30 30 30 29 30 > EOS ff_fasta = Bio::FlatFile.open(StringIO.new(fasta)) ff_qual = Bio::FlatFile.open(StringIO.new(qual)) while entry_fasta = ff.fasta.next_entry seq = entry_fasta.to_biosequence seq.quality_score_type = :phred seq.quality_scores = ff_qual.next_entry.data puts fastq.output(:fastq, :title => entry_fasta.definition) end #------------------------------------------------ > enum_fasta = Bio::FastaFormat.new(fasta).each_entry > enum_qual = Bio::FastaNumericFormat.new(qual).each_entry > > loop do > fastq = Bio::Sequence.adapter(enum_fasta.next, > Bio::Sequence::Adapter::Fastq) > fastq.quality_score_type = :phred > fastq.quality_scores = enum_qual.next.data > puts fastq.output(:fastq) > end Bio::Sequence.adapter is bioruby library internal use only, and normally should not be used by user scripts. In addition, using Adapter::Fastq for Bio::FastaFormat data is mismatch. In this case, use Bio::FastaFormat#to_biosequence. > > -- > MISHIMA, Hiroyuki, DDS, Ph.D. > COE Research Fellow > Department of Human Genetics > Nagasaki University Graduate School of Biomedical Sciences Thanks, Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From missy at be.to Fri Jan 29 11:24:15 2010 From: missy at be.to (MISHIMA, Hiroyuki) Date: Fri, 29 Jan 2010 20:24:15 +0900 Subject: [BioRuby] Proposal: Bio::FastaFormat#each_entry In-Reply-To: <20100129102530.3BBEA1CBC436@idnmail.gen-info.osaka-u.ac.jp> References: <4B628437.30305@be.to> <20100129102530.3BBEA1CBC436@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <4B62C55F.1050506@be.to> Hi, Naohisa GOTO, Thank you so much for detailed explanation and a sample code. It was big help for me to understand BioRuby's overall design. Although I used here-documents in my code, what I wanted to do was just make a FASTQ file from regular FASTA and FASTA.QUAL files. I tried your code using my relatively large input files. It was much faster than my code. The final code is simply the following: ---- require 'bio' ff_fasta = Bio::FlatFile.open(ARGV[0]) ff_qual = Bio::FlatFile.open(ARGV[0]+".qual") while entry_fasta = ff_fasta.next_entry seq = entry_fasta.to_biosequence seq.quality_score_type = :phred seq.quality_scores = ff_qual.next_entry.data puts seq.output(:fastq, :title => entry_fasta.definition) end ---- Hiro. Naohisa GOTO wrote (2010/01/29 19:25): > Hi, > > On Fri, 29 Jan 2010 15:46:15 +0900 > "MISHIMA, Hiroyuki" wrote: > >> Hi all, >> >> How about implementing the following methods? >> >> Bio::FastaFormat#each_entry >> Bio::FastaNumericFormat#each_entry >> >> The following is a sample code to generate a FASTQ string from a FASTA >> string and a FASTA.QUAL string. This sample may need ruby 1.8.7 or later. >> >> I am afraid that simpler or easier ways are already existed in BioRuby... > > I think mixing single entry parser with multiple entry iterator > will cause confusion, and not good way. > > For most parser classes in bioruby, expected data source is > String containing single entry data. In addition, for IO with > possible multiple entries, Bio::FlatFile is the front-end that > can detect data type, splits each entry, and calling assigned > parser class. > > For String containing multiple entries, using StringIO and > then Bio::FlatFile is the easiest way, although indirect. > Recently, many efficient memory-mapped data transfer methods > are available, e.g. memcached, IPC shared memory, mmap(2) > system call. I'm now thinking how to treat such data efficiently. -- MISHIMA, Hiroyuki, DDS, Ph.D. COE Research Fellow Department of Human Genetics Nagasaki University Graduate School of Biomedical Sciences From biopython at maubp.freeserve.co.uk Fri Jan 29 10:36:40 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 29 Jan 2010 10:36:40 +0000 Subject: [BioRuby] [Bioperl-l] [MOBY-dev] OpenBio solution challenge: Project updates at BOSC 2010 In-Reply-To: References: <20100128203505.GG40046@sobchak.mgh.harvard.edu> Message-ID: <320fb6e01001290236l1ad02515w403a19f94dbb6d15@mail.gmail.com> Hi all, This is a great topic but should be continue it on just the one mailing list? Is there a suitable BOSC list, or how about the general Open Bio list? On Thu, Jan 28, 2010 at 9:17 PM, Mark Wilkinson wrote: > > Brad, this sounds exciting! > > One thing strikes me, though - by asking for the sub-projects to propose > the "grand challenge" themselves the one thing you can guarantee is that > the "grand challenge" is solvable (or more likely, already solved!) > > Other "grand challenge" kinds of meetings have an independent third party > pose the problem that has to be solved, and then all groups work toward a > solution and compare their results. ?This would, IMO, be more revealing of > the "state of the art" in each Open-Bio project, and point out where the > weaknesses are that we should be focusing on... ?Someone (for example, > you!) could act as the moderator to ensure that the "grand challenge" was > at least a reasonable one, within the scope of what an Open-Bio project > *should* be able to solve... > > Just my CAD $0.02 > > Mark One possible problem with having Brad act as moderator is his ties to Biopython (plus it would be a shame if we'd be one man down for trying to solve the challenges - grin). Having a project representative "sign off" on the challenge might work - or simply the whole of the BOSC committee which is quite balanced. Alternatively some kind of panel of challenges does seem a good way to reduce individual project bias (as suggest by Scooter), but there will still need to be a judging committee. I'm curious what kind of challenges the BOSC committee had in mind - would something like taking a newly sequence bacteria and producing an automated annotation as a GenBank, EMBL, or GFF file be too ambitious for example? There are already several major projects to do this e.g. RAST http://rast.nmpdr.org/ Peter (@Biopython)