From ngoto at gen-info.osaka-u.ac.jp  Mon Jan  4 02:15:18 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Mon, 4 Jan 2010 16:15:18 +0900
Subject: [BioRuby] Codeml parser
In-Reply-To: <20091231141546.GA5770@thebird.nl>
References: <20091231141546.GA5770@thebird.nl>
Message-ID: <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp>

Hi,

I also think the current Bio::PAML::Codeml::Report is needed to be
rewritten. It is great if you do so. Here is my comments.

>  codeml = Bio::PAML::Codeml.new(nil, :runmode => 0, :RateAncestor => 1,
>                                      :alpha => 0.5, :fix_alpha => 0)
>  report = codeml.query(alignment, tree)
>
> which, as it happens, works. The 'nil' points to the program executable.
> 'nil' merely fills in 'codeml'. It would have been beter to make it one
> of the listed options, e.g. :binary => 'codeml'. That would save the ugly
> 'nil' parameter and belongs more to the principle of least surprise, that
> makes Ruby shine.

It is safe not to merge bioruby internal options and PAML's options.
If the upstream authors of PAML introduced a new option named binary,
severe problem would occur.

One way is to write a code that acts something like C++ polymorphism.
For example, the code below accepts the three cases.
* Bio::PAML::Codeml.new("/path/to/codeml")
* Bio::PAML::Codeml.new({ :xxx => yyy, :ppp => qqq })
* Bio::PAML::Codeml.new("/path/to/codeml", { :xxx => yyy, :ppp => qqq })

  def initialize(*argv)
    program = nil
    params = {}
    case argv.size
    when 0, 1
      begin
        params = argv[0].to_hash
      rescue NoMethodError
        program = argv[0]
      end
    when 2
      program, params = *argv
    else
      raise ArgumentError, "wrong number of arguments (#{argv.size} for 2)"
    end
    # continues to the current code...

The bad points are:
* Complexity of code is increased.
* It might make difficult to refactor codes, especially when keyword
   arguments are introduced in the future version of Ruby.

Note that Ruby's author Matz has said that he had not applied the
principle of least surprise to the design of Ruby.
(http://en.wikipedia.org/wiki/Ruby_(programming_language)#Philosophy )
Please be careful that the word "principle of least surprise (POLS)"
is NG word when you request something in Ruby.
(http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/26942 )

>  A new implementation of Bio::PAML::Codeml::Report

> So I propose to rewrite the class supporting for multiple models,
> with the following usage (starting from a codeml report - really result):
>
> >> report.models.size
> => 2
> >> report.models[0].name
> => "M0"

I suppose report.models returns a Hash containing objects of newly written
class (for example, Bio::PAML::Codeml::Report::Model) or Struct.
It seems good.

Existing methods could be changed to return the first model's values.

> Unit tests

Currently, tests with external dependencies (e.g. web services) are
located in the test/functional/ directory. So, your tests running
codeml would be named test/functional/bio/appl/paml/test_codeml.rb,
test/functional/bio/appl/paml/codeml/test_report.rb, or something
like this.

> These tests, for example, can be run on a special switch:
>
>  runner.rb --test-dependencies

I'm now searching ways to pass such parameters to tests.
Note that tests can also be run in various ways. For example,
  ruby test/unit/bio/appl/paml/codeml/test_report.rb 
  testrb test/unit/bio/appl/paml/codeml
  rake test

> I am sure it works, but doesn't anyone think this belongs in a support
> module (e.g. BioTestFile) for testing? What I would like to see is
> something less brittle:
>
>  require 'bio/test'
>  str = BioTestFile::read('paml/codeml/output.txt')

I'd like to keep tests simple and clear, and I think using standard
File.read is enough and clearer. When using such special class, to know
the behavior of the test code, reading extra file is needed.

> Personally, I dislike the naming/name space scheme of Bioruby.
> What to think of invoking a class named
>
>  report = Bio::PAML::Codeml::Report.new

Because there are many bioinformatics software and databases, names
tends to be longer, and nesting of namespace tends to be deeper.
I'd like to know naming rules and policies of other open-bio projects.

> Why can't it just be
>
>  include Bio
>  report = Codeml.new

I think it is enough to write "include Bio::PAML" instead of (or in
addition to) "include Bio".

>  include Bio
>  result = Paml.new(:program => 'codeml')

I don't like introducing such new parameter like :program.
I think 1 class 1 binary is better.

In addition, because the differences within PAML tools (codeml, baseml,
yn00, etc.) are currently not small, merging the classes is not so
realistic now.

On Thu, 31 Dec 2009 15:15:46 +0100
Pjotr Prins <pjotr.public14 at thebird.nl> wrote:

> Hi Michael,
> 
> I have a writeup on improving the current PAML functionality. Are you
> OK with this?
> 
>   http://bioruby.open-bio.org/wiki/BIORUBY_PAML
> 
> (maybe it does not belong on the bioruby Wiki - but I think of it
> like a 'design' document).
> 
> Pj.
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

From pjotr.public14 at thebird.nl  Mon Jan  4 04:03:18 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Mon, 4 Jan 2010 10:03:18 +0100
Subject: [BioRuby] Bioruby design
Message-ID: <20100104090318.GA16136@thebird.nl>

Thanks for the reply Naohisa. As we are moving on to design, rather
than one implementation I am changing the thread.

On Mon, Jan 04, 2010 at 04:15:18PM +0900, Naohisa GOTO wrote:
> It is safe not to merge bioruby internal options and PAML's options.
> If the upstream authors of PAML introduced a new option named binary,
> severe problem would occur.

I am against breaking interfaces. This is a minor design problem
which should be avoided in the future. And, yes, I would certainly
not favour a polymorphism solution, unless unavoidable. 

I don't think it is worth 'fixing' this interface aspect at this stage. 

Perhaps, there will be opportunities later.

> Note that Ruby's author Matz has said that he had not applied the
> principle of least surprise to the design of Ruby.
> (http://en.wikipedia.org/wiki/Ruby_(programming_language)#Philosophy )
> Please be careful that the word "principle of least surprise (POLS)"
> is NG word when you request something in Ruby.
> (http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/26942 )

I did not know that, and personally I do not care. I think POLS is a
really good idea, though it should not automatically come at the
expense of (for example) convenience, or performance. I favour easy
API's, and that is where the principle of least surprise comes in. It
means to me that I don't have to fetch the manuals every time (like I
do with Perl). So, let's not throw away the baby with the bath water.

I like POLS, as much as I like KISS.

> > >> report.models[0].name
> > => "M0"
> 
> I suppose report.models returns a Hash containing objects of newly written
> class (for example, Bio::PAML::Codeml::Report::Model) or Struct.
> It seems good.

In fact, I have made it an array. See my PAML branch.

> >  runner.rb --test-dependencies
> 
> I'm now searching ways to pass such parameters to tests.

In the runner you can parse the parameters first and pull them off
the stack. I did something like that for cfruby:

  http://cfruby.rubyforge.org/git?p=cfruby.git;a=blob;f=test/runner.rb;h=c202e48783a744c4cb3e339e2b891b3eab354c3e;hb=HEAD

 
> I'd like to keep tests simple and clear, and I think using standard
> File.read is enough and clearer. When using such special class, to know
> the behavior of the test code, reading extra file is needed.

I disagree, but that is obvious. 

> > Personally, I dislike the naming/name space scheme of Bioruby.
> > What to think of invoking a class named
> >
> >  report = Bio::PAML::Codeml::Report.new
> 
> Because there are many bioinformatics software and databases, names
> tends to be longer, and nesting of namespace tends to be deeper.
> I'd like to know naming rules and policies of other open-bio projects.

I think we should not mirror ourselves on these. We can do better.
RoR is a much better example to mirror ourselves on.

> > Why can't it just be
> >
> >  include Bio
> >  report = Codeml.new
> 
> I think it is enough to write "include Bio::PAML" instead of (or in
> addition to) "include Bio".

Not really. It brings in another source of errors for users if they
have to think about that context every time. We will get all
variants, like Bio::Kegg, Bio::Sequence etc.

I think name spaces are there to *avoid* conflict. If a naming scheme
precludes conflict, why bring in another layer?

I want Bioruby to be as easy as possible, and with the least
amount of typing. More text = harder to read.

> >  include Bio
> >  result = Paml.new(:program => 'codeml')
> 
> I don't like introducing such new parameter like :program.
> I think 1 class 1 binary is better.

I agree. It was just another option.

> In addition, because the differences within PAML tools (codeml, baseml,
> yn00, etc.) are currently not small, merging the classes is not so
> realistic now.

We have to separate our own conveniences from design choices.

Meanwhile I do agree we should not change the current interfaces. We
can create a new version of Bioruby with both old and new interfaces
supported. That is one thing I propose.

I am putting together a discussion document on the future of Bioruby
(design choices). We will have opportunity to discuss that in Japan.
We can consider raising a community vote once we have a list of
options.

Pj.


From pjotr.public14 at thebird.nl  Mon Jan  4 06:51:05 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Mon, 4 Jan 2010 12:51:05 +0100
Subject: [BioRuby] Codeml parser
In-Reply-To: <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp>
References: <20091231141546.GA5770@thebird.nl>
	<20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <20100104115105.GA21035@thebird.nl>

I have updated the writeup at

  http://bioruby.open-bio.org/wiki/BIORUBY_PAML

have a look at my PAML branch. The (old) unit tests pass.

  http://github.com/pjotrp/bioruby/tree/PAML

I have to add the positive selection sites, to complete it.

Pj.


From tomoakin at kenroku.kanazawa-u.ac.jp  Mon Jan  4 07:33:20 2010
From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA)
Date: Mon, 4 Jan 2010 21:33:20 +0900
Subject: [BioRuby] Bioruby design
In-Reply-To: <20100104090318.GA16136@thebird.nl>
References: <20100104090318.GA16136@thebird.nl>
Message-ID: <14BB7DE3-1D1B-4180-91EE-156EAADF4A98@kenroku.kanazawa-u.ac.jp>

Hi,

 > As people tend not to think of Paml as a toolbox I would prefer

 > to have one object names Paml. With behind it the codeml 'engine'

 > and reporter. This would work for me (also note Paml does

 > not return a report, but rather a result):


I don't agree in this point.  PHYLIP is clearly a package or  
collection of
programs, and so is considered Molphy, PAML, ...

 > result = Paml.new(:program => 'codeml')
And if you make a single object, it is not to obvious to divide based  
on the program,
since aaml is now done by codeml but should be considered clearly  
different
function.

>>>  include Bio
>>>  report = Codeml.new
>>>
>>
>> I think it is enough to write "include Bio::PAML" instead of (or in
>> addition to) "include Bio".
>>
>
> Not really. It brings in another source of errors for users if they
> have to think about that context every time. We will get all
> variants, like Bio::Kegg, Bio::Sequence etc.


These are short enought, since we have to write something like
"PAML ver XXX (Yang, XX) was used for XX" and "KEGG (Kanehisa, XXX)"...
in the manuscript of the paper if we use that module.
Stating their use explicitly in the first lines of the
program is considered good.

On the other hand, I don't like "include Bio::Sequence", since it is  
a function
of bioruby in itself.
-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan


From pjotr.public14 at thebird.nl  Mon Jan  4 10:04:59 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Mon, 4 Jan 2010 16:04:59 +0100
Subject: [BioRuby] Bioruby design
In-Reply-To: <14BB7DE3-1D1B-4180-91EE-156EAADF4A98@kenroku.kanazawa-u.ac.jp>
References: <20100104090318.GA16136@thebird.nl>
	<14BB7DE3-1D1B-4180-91EE-156EAADF4A98@kenroku.kanazawa-u.ac.jp>
Message-ID: <20100104150459.GB21412@thebird.nl>

On Mon, Jan 04, 2010 at 09:33:20PM +0900, Tomoaki NISHIYAMA wrote:
> These are short enought, since we have to write something like
> "PAML ver XXX (Yang, XX) was used for XX" and "KEGG (Kanehisa, XXX)"...
> in the manuscript of the paper if we use that module.
> Stating their use explicitly in the first lines of the
> program is considered good.

Uhm. I think that is a bit far fetched. The way you propose it is
that you would have to load the name space every time you use
something in code:

  require 'bio'

  include Bio::PAML
  include Bio::Kegg
  include ...
  
  do something

next source file, the same. And again:

  require 'bio'

  include Bio::PAML
  include Bio::Kegg
  include ...
  
  do something

This is the philosophy of Python - where every source file explicitly
loads all modules/name spaces.

It is arguably 'clear'. But ugly. And, takes the fun out of
programming (anyone mention that?).

Only once I have used the Python name spacing with good effect. It was
when we plugged in a replacement module - completely rewritten. That
was changing one line only - and it worked :-). In Python you can say

  import Paml as paml

it became

  import Paml2 as paml

That was nice. But whan you see Python source files, the header is
ugly, and wastes a lot of typing. See for example:

  http://pypi.python.org/pypi/zope.sqlalchemy#example

I argue not to state imports. import Bio should be part of 

  require 'bio'

Anyway, we will have time to talk in Tokyo, I hope. 

Pj.


P.S. Do you have an example of anyone quoting a Bioruby module in a
paper?


From pjotr.public14 at thebird.nl  Mon Jan  4 12:09:04 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Mon, 4 Jan 2010 18:09:04 +0100
Subject: [BioRuby] Codeml parser
In-Reply-To: <20100104115105.GA21035@thebird.nl>
References: <20091231141546.GA5770@thebird.nl>
	<20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp>
	<20100104115105.GA21035@thebird.nl>
Message-ID: <20100104170904.GA26187@thebird.nl>

The writeup is pretty much done, as well as the implementation.

  http://bioruby.open-bio.org/wiki/BIORUBY_PAML

All unit tests pass:

  Running tests for PAML
  Loaded suite .
  Started
  ....................
  Finished in 0.398394 seconds.
  20 tests, 37 assertions, 0 failures, 0 errors

It is compatible with the old version. I have added 41 assertions
in the doctest (the header of report.rb).

  === Testing 'mydoc.test'...
  1.   OK  | Default Test
  41 comparisons, 1 doctests, 0 failures, 0 errors

You can view the tests and implementation at

  http://github.com/pjotrp/bioruby/blob/PAML/lib/bio/appl/paml/codeml/report.rb
See also 

The branch is:

  http://github.com/pjotrp/bioruby/tree/PAML

(don't you love github).

Pj.


From mail at michaelbarton.me.uk  Mon Jan  4 12:50:50 2010
From: mail at michaelbarton.me.uk (Michael Barton)
Date: Mon, 4 Jan 2010 12:50:50 -0500
Subject: [BioRuby] Codeml parser
In-Reply-To: <20100104170904.GA26187@thebird.nl>
References: <20091231141546.GA5770@thebird.nl>
	<20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> 
	<20100104115105.GA21035@thebird.nl> <20100104170904.GA26187@thebird.nl>
Message-ID: <c27b73c1001040950h390b0d0ej201b59e1a071fe35@mail.gmail.com>

Hi Pjotr,

The expand report.rb looks like an excellent and substantial
improvement to the previous version. You could add a depreciated tag
to the old interface methods and these could then be removed in a
later bioruby version to decrease clutter in the API.

Mike

2010/1/4 Pjotr Prins <pjotr.public14 at thebird.nl>:
> The writeup is pretty much done, as well as the implementation.
>
> ?http://bioruby.open-bio.org/wiki/BIORUBY_PAML
>
> All unit tests pass:
>
> ?Running tests for PAML
> ?Loaded suite .
> ?Started
> ?....................
> ?Finished in 0.398394 seconds.
> ?20 tests, 37 assertions, 0 failures, 0 errors
>
> It is compatible with the old version. I have added 41 assertions
> in the doctest (the header of report.rb).
>
> ?=== Testing 'mydoc.test'...
> ?1. ? OK ?| Default Test
> ?41 comparisons, 1 doctests, 0 failures, 0 errors
>
> You can view the tests and implementation at
>
> ?http://github.com/pjotrp/bioruby/blob/PAML/lib/bio/appl/paml/codeml/report.rb
> See also
>
> The branch is:
>
> ?http://github.com/pjotrp/bioruby/tree/PAML
>
> (don't you love github).
>
> Pj.
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>


From ngoto at gen-info.osaka-u.ac.jp  Tue Jan  5 02:42:49 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Tue, 5 Jan 2010 16:42:49 +0900
Subject: [BioRuby] Codeml parser
In-Reply-To: <20100104170904.GA26187@thebird.nl>
References: <20091231141546.GA5770@thebird.nl>
	<20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp>
	<20100104115105.GA21035@thebird.nl>
	<20100104170904.GA26187@thebird.nl>
Message-ID: <20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp>

Hi Pjotr,

I'm reading the code (commit c2de9dd3ad055bab4bfb1d3e8da840493b110b0e).
It is generally good. Below are my comments and suggested changes.

>    # == Examples
>    #
>    # Read the codeml M0-M3 data file into a buffer
>    #
>    # >> require 'bio/test/biotestfile'
>    # >> buf = BioTestFile.read('paml/codeml/models/results0-3.txt')

It is not suitable to use such nonstandard class in the example.
Users want to know the example usage and do not intend to test.
Note that I still disagree with the BioTestFile class.

>    class Report < Bio::PAML::Common::Report
> 
>      attr_reader :models, :header, :footer

RDoc documentation is also needed for attributes. To write RDoc,
the three attribute definitions are needed to be separated.
For example,

      # Models in the result
      # (Array containing Bio::PAML::Codeml::Model objects)
      attr_reader :models

      # ...(should be written)
      attr_reader :header

      # ...(should be written)
      attr_reader :footer

>      # Parse codeml output file passed with +buf+
>      def initialize buf

Details of +buf+ (class, contents, etc) should also be written in RDoc.
It is recommended to use the style written in the README_DEV.rdoc, or
the style used in the Ruby source code.

Please do not omit parentheses in the method definition lines.

>    # Model class
>    class Model 

Too few documentation. At least please write a message that it is
created by Bio::PAML::Codeml::Report.

>      def initialize buf

Please write RDoc that normal users do not use the method directly,
and internally called inside the Bio::PAML::Codeml::Report objects.

Please do not omit parentheses in the method definition lines.

>      def lnL

Writing RDoc document is needed. In addition, for omega, kappa, alpha,
tree_length, tree, and to_s methods.

>    class PositiveSite

Almost all methods have no RDoc documantation.

>      def to_a
>        [ @position, @aaref, @probability, @omega ]
>      end

What is the purpose of the method?

>    class PositiveSites < Array

To inherit Array and to create original container class is discouraged.
In BioRuby, we have deprecated Bio::Features and Bio::References in
version 1.3.0, although they do not inherit Array but have an array
in the object. (The classes still exist only for backward compatibility,
in lib/bio/compat/features.rb and references.rb).

In this case, except initialize, only a method named "graph" is added.
I think it is good to add the graph method in the Report class and
using an Array for storing PositiveSite objects.

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

From pjotr.public14 at thebird.nl  Tue Jan  5 05:32:12 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Tue, 5 Jan 2010 11:32:12 +0100
Subject: [BioRuby] Codeml parser
In-Reply-To: <20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp>
References: <20091231141546.GA5770@thebird.nl>
	<20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp>
	<20100104115105.GA21035@thebird.nl>
	<20100104170904.GA26187@thebird.nl>
	<20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <20100105103212.GA4584@thebird.nl>

Hi Naohisa,

First I thought you were kidding. But then I realise you are serious.

I don't think we need to document every simple class variable/accessor
to accept this source code. That is overkill. If you don't understand
lnL or alpha, don't use it. We are not in the business of documenting
for documenting's sake.  Documenting lnL and alpha will be like:

"Retrieve the lnL value from the Report" 

"Retrieve the alpha value from the Report" 

etc. etc. I don't think we should be doing that. Standard 1?to-1
relations are obvious and don't need lots of text in the code base.

If someone feels like filling in these obvious statements, fine. It
really goes against my grain. Do we document every single accessor?
Note the previous implementation did no such thing. That code was
accepted fine (and partially written by you).

> Details of +buf+ (class, contents, etc) should also be written in RDoc.
> It is recommended to use the style written in the README_DEV.rdoc, or
> the style used in the Ruby source code.

You mean the contents of the input buffer, which is the content of the
input file? I see many places in Bioruby where no such a thing is
done.  Why become strict on this now? If you want a different
descriptive name for the variable - that is fine. Propose me
a better name.

> >      def to_a
> >        [ @position, @aaref, @probability, @omega ]
> >      end
> What is the purpose of the method?

Access converter. Convenience, really. You can remove it if you
dislike it so much. I use it for testing and to write to a file. Could
be to_s too, but that fixates the format.

> >    class PositiveSites < Array
> 
> To inherit Array and to create original container class is discouraged.
> In BioRuby, we have deprecated Bio::Features and Bio::References in
> version 1.3.0, although they do not inherit Array but have an array
> in the object. (The classes still exist only for backward compatibility,
> in lib/bio/compat/features.rb and references.rb).

PositiveSites object has the all the features of a list (ie Array). I
think inheritance is what it should be. It is an is_a relationship.
Adding a @list will just add code. Not only for initialization, but
also for iterators. I only see how we can move backwards from readable
code. Nor is it good OOP practice. Inheritance is not *always* bad,
though I agree it is used too quickly (in general).

> In this case, except initialize, only a method named "graph" is added.
> I think it is good to add the graph method in the Report class and
> using an Array for storing PositiveSite objects.

This is awful. The graph is a feature of PositiveSites, and not of the
report *parser*. To keep things simple it is best practise to have
functionality where it belongs. It is good OOP design. Your proposal
means the Report class becomes less obvious in what it is. Look how
clean it is now!

What do other people think on this list. I am at a disadvantage here.

I would like this code accepted in Bioruby, so other people can use
it. I disagree with most of above 'criticism'. I certainly balk at the
last non-OOP ones. This is not the first time I am really unhappy. I
can't believe how much trouble I have to go to for a simple class,
which, as it happens, has a perfectly acceptable implementation by
most measures.

Pj.


From jan.aerts at gmail.com  Tue Jan  5 06:53:53 2010
From: jan.aerts at gmail.com (Jan Aerts)
Date: Tue, 5 Jan 2010 11:53:53 +0000
Subject: [BioRuby] Codeml parser
In-Reply-To: <20100105103212.GA4584@thebird.nl>
References: <20091231141546.GA5770@thebird.nl>
	<20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp>
	<20100104115105.GA21035@thebird.nl>
	<20100104170904.GA26187@thebird.nl>
	<20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp>
	<20100105103212.GA4584@thebird.nl>
Message-ID: <4c7507a71001050353v4e4654abn79f9cbcaeef262e2@mail.gmail.com>

All,

It appears that the pre-hackathon bioruby meeting will be very useful :-)
Why don't we use that time to focus on the bit-more-distant future of
bioruby: bioruby 2.0? We could discuss what it should look like without
having to worry about backward compatibility. Topics:
* documentation style (I happen to agree with Naohisa on that)
* class hierarchy: how would we organize the information if we had to start
from scratch? (maybe we should follow bioperl's lead with a Root class?)
* coding style
* general interface decisions
* ...

jan.

PS: Still don't know if I can make it to Japan. Will know this afternoon
(broken foot might interfere...)

2010/1/5 Pjotr Prins <pjotr.public14 at thebird.nl>

> Hi Naohisa,
>
> First I thought you were kidding. But then I realise you are serious.
>
> I don't think we need to document every simple class variable/accessor
> to accept this source code. That is overkill. If you don't understand
> lnL or alpha, don't use it. We are not in the business of documenting
> for documenting's sake.  Documenting lnL and alpha will be like:
>
> "Retrieve the lnL value from the Report"
>
> "Retrieve the alpha value from the Report"
>
> etc. etc. I don't think we should be doing that. Standard 1?to-1
> relations are obvious and don't need lots of text in the code base.
>
> If someone feels like filling in these obvious statements, fine. It
> really goes against my grain. Do we document every single accessor?
> Note the previous implementation did no such thing. That code was
> accepted fine (and partially written by you).
>
> > Details of +buf+ (class, contents, etc) should also be written in RDoc.
> > It is recommended to use the style written in the README_DEV.rdoc, or
> > the style used in the Ruby source code.
>
> You mean the contents of the input buffer, which is the content of the
> input file? I see many places in Bioruby where no such a thing is
> done.  Why become strict on this now? If you want a different
> descriptive name for the variable - that is fine. Propose me
> a better name.
>
> > >      def to_a
> > >        [ @position, @aaref, @probability, @omega ]
> > >      end
> > What is the purpose of the method?
>
> Access converter. Convenience, really. You can remove it if you
> dislike it so much. I use it for testing and to write to a file. Could
> be to_s too, but that fixates the format.
>
> > >    class PositiveSites < Array
> >
> > To inherit Array and to create original container class is discouraged.
> > In BioRuby, we have deprecated Bio::Features and Bio::References in
> > version 1.3.0, although they do not inherit Array but have an array
> > in the object. (The classes still exist only for backward compatibility,
> > in lib/bio/compat/features.rb and references.rb).
>
> PositiveSites object has the all the features of a list (ie Array). I
> think inheritance is what it should be. It is an is_a relationship.
> Adding a @list will just add code. Not only for initialization, but
> also for iterators. I only see how we can move backwards from readable
> code. Nor is it good OOP practice. Inheritance is not *always* bad,
> though I agree it is used too quickly (in general).
>
> > In this case, except initialize, only a method named "graph" is added.
> > I think it is good to add the graph method in the Report class and
> > using an Array for storing PositiveSite objects.
>
> This is awful. The graph is a feature of PositiveSites, and not of the
> report *parser*. To keep things simple it is best practise to have
> functionality where it belongs. It is good OOP design. Your proposal
> means the Report class becomes less obvious in what it is. Look how
> clean it is now!
>
> What do other people think on this list. I am at a disadvantage here.
>
> I would like this code accepted in Bioruby, so other people can use
> it. I disagree with most of above 'criticism'. I certainly balk at the
> last non-OOP ones. This is not the first time I am really unhappy. I
> can't believe how much trouble I have to go to for a simple class,
> which, as it happens, has a perfectly acceptable implementation by
> most measures.
>
> Pj.
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>


From pjotr.public14 at thebird.nl  Tue Jan  5 07:39:02 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Tue, 5 Jan 2010 13:39:02 +0100
Subject: [BioRuby] Clustal ALN writer
Message-ID: <20100105123902.GA10823@thebird.nl>

I propose to write an ALN output writer. ALN files show aligned
sequences with additional lines of information (like a match line). I
want to use it to output PAML positive selection sites. This is
the idea:


SEQ1  alignment 1...
SEQ2  alignment 2...
      ...*.:*....***  (match line)
      ...*....*.....  (pos. sel. line)

Do we want such ALN output (I think it is allowed), and can we allow
for the additional output. I have a proposed interface here:
 
  http://github.com/pjotrp/bioruby/commit/7f320781039b56aee991ab72404655fae210e2cb

I notice ClustalW.to_fasta has been obsoleted. But we don't have
to_aln yet, and we need to allow adding match_lines and other
information.

Pj.


From ngoto at gen-info.osaka-u.ac.jp  Tue Jan  5 08:20:24 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Tue, 5 Jan 2010 22:20:24 +0900
Subject: [BioRuby] Codeml parser
In-Reply-To: <20100105103212.GA4584@thebird.nl>
References: <20091231141546.GA5770@thebird.nl>
	<20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp>
	<20100104115105.GA21035@thebird.nl>
	<20100104170904.GA26187@thebird.nl>
	<20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp>
	<20100105103212.GA4584@thebird.nl>
Message-ID: <20100105132025.B532E1CBC40C@idnmail.gen-info.osaka-u.ac.jp>

Hi Pjotr,

On Tue, 5 Jan 2010 11:32:12 +0100
Pjotr Prins <pjotr.public14 at thebird.nl> wrote:

> Hi Naohisa,
> 
> First I thought you were kidding. But then I realise you are serious.
> 
> I don't think we need to document every simple class variable/accessor
> to accept this source code. That is overkill. If you don't understand
> lnL or alpha, don't use it. We are not in the business of documenting
> for documenting's sake.  Documenting lnL and alpha will be like:
> 
> "Retrieve the lnL value from the Report" 
> 
> "Retrieve the alpha value from the Report" 
> 
> etc. etc. I don't think we should be doing that. Standard 1-to-1
> relations are obvious and don't need lots of text in the code base.

Even just one word is OK, e.g. "lnL", "alpha".
But no RDoc is not allowed.

Ideally, it may be really great if well informative description
can help people unfamiliar with Codeml, and this may encourage
people beginning to use Codeml with BioRuby. I understand this
can not be easily achieved. When writing a new class or largely
adding codes, it is also good to implement first with least
documentation and later to improve documents gradually.

> If someone feels like filling in these obvious statements, fine. It
> really goes against my grain. Do we document every single accessor?
> Note the previous implementation did no such thing. That code was
> accepted fine (and partially written by you).

In late 2005, we determined that all methods, attributes, classes,
modules, etc. should be documented by using RDoc. Codes written
before earlier 2006 may have no RDoc. I'm working to add RDoc in
such codes gradually, but not finished yet.

> > Details of +buf+ (class, contents, etc) should also be written in RDoc.
> > It is recommended to use the style written in the README_DEV.rdoc, or
> > the style used in the Ruby source code.
> 
> You mean the contents of the input buffer, which is the content of the
> input file? I see many places in Bioruby where no such a thing is
> done.  Why become strict on this now? If you want a different
> descriptive name for the variable - that is fine. Propose me
> a better name.

No need to change the variable name. I mean I want to clarify
that it points contents of the file and not filename.
If you think current description is enough apparent, it is OK.

> > >      def to_a
> > >        [ @position, @aaref, @probability, @omega ]
> > >      end
> > What is the purpose of the method?
> 
> Access converter. Convenience, really. You can remove it if you
> dislike it so much. I use it for testing and to write to a file. Could
> be to_s too, but that fixates the format.

OK if you feel useful.

> > >    class PositiveSites < Array
> > 
> > To inherit Array and to create original container class is discouraged.
> > In BioRuby, we have deprecated Bio::Features and Bio::References in
> > version 1.3.0, although they do not inherit Array but have an array
> > in the object. (The classes still exist only for backward compatibility,
> > in lib/bio/compat/features.rb and references.rb).
> 
> PositiveSites object has the all the features of a list (ie Array). I
> think inheritance is what it should be. It is an is_a relationship.
> Adding a @list will just add code. Not only for initialization, but
> also for iterators. I only see how we can move backwards from readable
> code. Nor is it good OOP practice. Inheritance is not *always* bad,
> though I agree it is used too quickly (in general).
> 
> > In this case, except initialize, only a method named "graph" is added.
> > I think it is good to add the graph method in the Report class and
> > using an Array for storing PositiveSite objects.
> 
> This is awful. The graph is a feature of PositiveSites, and not of the
> report *parser*. To keep things simple it is best practise to have
> functionality where it belongs. It is good OOP design. Your proposal
> means the Report class becomes less obvious in what it is. Look how
> clean it is now!

I respect your design if the class is not only a container of
PositiveSite objects but also having methods doing special things
by using relations among two or more objects which is not a simple
accumulation of each object's information.

> What do other people think on this list. I am at a disadvantage here.
>
> I would like this code accepted in Bioruby, so other people can use
> it. I disagree with most of above 'criticism'. I certainly balk at the
> last non-OOP ones. This is not the first time I am really unhappy. I
> can't believe how much trouble I have to go to for a simple class,
> which, as it happens, has a perfectly acceptable implementation by
> most measures.
> 
> Pj.
> 

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org


From ngoto at gen-info.osaka-u.ac.jp  Tue Jan  5 08:28:28 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Tue, 5 Jan 2010 22:28:28 +0900
Subject: [BioRuby] Clustal ALN writer
In-Reply-To: <20100105123902.GA10823@thebird.nl>
References: <20100105123902.GA10823@thebird.nl>
Message-ID: <20100105132829.27F7B1CBC401@idnmail.gen-info.osaka-u.ac.jp>

Hi Pjotr,

There is already Bio::Alignment#output_clustal method.
It is implemented in Bio::Alignment::Output module.

http://bioruby.org/rdoc/classes/Bio/Alignment/Output.html#M000092

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

On Tue, 5 Jan 2010 13:39:02 +0100
Pjotr Prins <pjotr.public14 at thebird.nl> wrote:

> I propose to write an ALN output writer. ALN files show aligned
> sequences with additional lines of information (like a match line). I
> want to use it to output PAML positive selection sites. This is
> the idea:
> 
> 
> SEQ1  alignment 1...
> SEQ2  alignment 2...
>       ...*.:*....***  (match line)
>       ...*....*.....  (pos. sel. line)
> 
> Do we want such ALN output (I think it is allowed), and can we allow
> for the additional output. I have a proposed interface here:
>  
>   http://github.com/pjotrp/bioruby/commit/7f320781039b56aee991ab72404655fae210e2cb
> 
> I notice ClustalW.to_fasta has been obsoleted. But we don't have
> to_aln yet, and we need to allow adding match_lines and other
> information.
> 
> Pj.
> 

From pjotr.public14 at thebird.nl  Tue Jan  5 12:04:34 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Tue, 5 Jan 2010 18:04:34 +0100
Subject: [BioRuby] Codeml parser
In-Reply-To: <20100105132025.B532E1CBC40C@idnmail.gen-info.osaka-u.ac.jp>
References: <20091231141546.GA5770@thebird.nl>
	<20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp>
	<20100104115105.GA21035@thebird.nl>
	<20100104170904.GA26187@thebird.nl>
	<20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp>
	<20100105103212.GA4584@thebird.nl>
	<20100105132025.B532E1CBC40C@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <20100105170434.GB13498@thebird.nl>

Hi Naohisa,

Thanks for clarifying. I am happy now.

Pj.


From pjotr.public14 at thebird.nl  Tue Jan  5 12:09:25 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Tue, 5 Jan 2010 18:09:25 +0100
Subject: [BioRuby] Clustal ALN writer
In-Reply-To: <20100105132829.27F7B1CBC401@idnmail.gen-info.osaka-u.ac.jp>
References: <20100105123902.GA10823@thebird.nl>
	<20100105132829.27F7B1CBC401@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <20100105170925.GA13828@thebird.nl>

On Tue, Jan 05, 2010 at 10:28:28PM +0900, Naohisa GOTO wrote:
> Hi Pjotr,
> 
> There is already Bio::Alignment#output_clustal method.
> It is implemented in Bio::Alignment::Output module.
> 
> http://bioruby.org/rdoc/classes/Bio/Alignment/Output.html#M000092

I missed that. Still it has no functionality for adding the
match_line, nor for adding extra information lines. Can I modify this
to give this method an optional parameter (list of String) for this?

The Alignment class is not aware of 'imported' match lines (it is Clustal
specific in Bioruby at this stage). 

How do you suppose we can do this so I can generate the ALN with
multiple match lines?

Pj.

From ngoto at gen-info.osaka-u.ac.jp  Tue Jan  5 22:31:25 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Wed, 6 Jan 2010 12:31:25 +0900
Subject: [BioRuby] Clustal ALN writer
In-Reply-To: <20100105170925.GA13828@thebird.nl>
References: <20100105123902.GA10823@thebird.nl>
	<20100105132829.27F7B1CBC401@idnmail.gen-info.osaka-u.ac.jp>
	<20100105170925.GA13828@thebird.nl>
Message-ID: <20100106033125.754941CBC3D4@idnmail.gen-info.osaka-u.ac.jp>

Hi,

On Tue, 5 Jan 2010 18:09:25 +0100
Pjotr Prins <pjotr.public14 at thebird.nl> wrote:

> On Tue, Jan 05, 2010 at 10:28:28PM +0900, Naohisa GOTO wrote:
> > Hi Pjotr,
> > 
> > There is already Bio::Alignment#output_clustal method.
> > It is implemented in Bio::Alignment::Output module.
> > 
> > http://bioruby.org/rdoc/classes/Bio/Alignment/Output.html#M000092
> 
> I missed that. Still it has no functionality for adding the
> match_line, nor for adding extra information lines. Can I modify this
> to give this method an optional parameter (list of String) for this?
>
> The Alignment class is not aware of 'imported' match lines (it is Clustal
> specific in Bioruby at this stage). 

The output_clustal method gets an argument named "options" as a Hash.
The match line can be altered by any given string with an option.

  alignment.output_clustal(:match_line => str)

I'm very sorry for incomplete documentation. It was first written
in 2003, and documents were added after 2005 but still incomplete.

Bio::Alignment#match_line method is the match line calculation 
method with the same algorithm as ClustalW.

> How do you suppose we can do this so I can generate the ALN with
> multiple match lines?

I'm afraid this is not regarded as Clustal format.
Of course, it is technically easy to add such function.

There may be many private extensions of Clustal format.
I think this is OK because Clustal format is rough,
although this makes hard to validate Clustal format.


Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

From pjotr.public14 at thebird.nl  Wed Jan  6 03:07:10 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Wed, 6 Jan 2010 09:07:10 +0100
Subject: [BioRuby] Clustal ALN writer
In-Reply-To: <20100106033125.754941CBC3D4@idnmail.gen-info.osaka-u.ac.jp>
References: <20100105123902.GA10823@thebird.nl>
	<20100105132829.27F7B1CBC401@idnmail.gen-info.osaka-u.ac.jp>
	<20100105170925.GA13828@thebird.nl>
	<20100106033125.754941CBC3D4@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <20100106080710.GA23141@thebird.nl>

On Wed, Jan 06, 2010 at 12:31:25PM +0900, Naohisa GOTO wrote:
> > How do you suppose we can do this so I can generate the ALN with
> > multiple match lines?
> 
> I'm afraid this is not regarded as Clustal format.
> Of course, it is technically easy to add such function.
> 
> There may be many private extensions of Clustal format.
> I think this is OK because Clustal format is rough,
> although this makes hard to validate Clustal format.

Standards are vague. EMBOSS does not even mention the match line, but
as ClustalW generates it we assume it is a 'standard'. I think most
parsers basically ignore lines starting with white space. So multiple
'match lines' should normally work. Many standards in bioinformatics
evolve from use - maybe my idea will become a standard one day ;-).

I think it is a nice feature to have. I'll add a warning that one
should use it with caution.

BTW the ALN-writer should really live in its own class/module, similar
to the current layout for the 'Report' class (which in reality is an
ALN parser, or ALN-reader). It is no surprise I did not find either of
them when I was looking for an implementation.

OK, I'll cook something up in a separate git branch.

Pj.

From mail at michaelbarton.me.uk  Wed Jan  6 11:58:01 2010
From: mail at michaelbarton.me.uk (Michael Barton)
Date: Wed, 6 Jan 2010 11:58:01 -0500
Subject: [BioRuby] Codeml parser
In-Reply-To: <4c7507a71001050353v4e4654abn79f9cbcaeef262e2@mail.gmail.com>
References: <20091231141546.GA5770@thebird.nl>
	<20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> 
	<20100104115105.GA21035@thebird.nl> <20100104170904.GA26187@thebird.nl>
	<20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp> 
	<20100105103212.GA4584@thebird.nl>
	<4c7507a71001050353v4e4654abn79f9cbcaeef262e2@mail.gmail.com>
Message-ID: <c27b73c1001060858u4598b22an14ee320a46ec1a0@mail.gmail.com>

2010/1/5 Jan Aerts <jan.aerts at gmail.com>:
> It appears that the pre-hackathon bioruby meeting will be very useful :-)
> Why don't we use that time to focus on the bit-more-distant future of
> bioruby: bioruby 2.0? We could discuss what it should look like without
> having to worry about backward compatibility.

I second what Jan has suggested about the direction of BioRuby and
version 2.0. As Ruby becomes more popular a programming language in
bioinformatics it might be expected that BioRuby will receive more and
more contributions. Prior to BioRuby 2.0 might be a nice time to
discuss how BioRuby will grow and be organised as it increases in
size.

Topics:
> * documentation style (I happen to agree with Naohisa on that)
> * class hierarchy: how would we organize the information if we had to start
> from scratch? (maybe we should follow bioperl's lead with a Root class?)
> * coding style
> * general interface decisions
> * ...
>
> jan.
>
> PS: Still don't know if I can make it to Japan. Will know this afternoon
> (broken foot might interfere...)
>
> 2010/1/5 Pjotr Prins <pjotr.public14 at thebird.nl>
>
>> Hi Naohisa,
>>
>> First I thought you were kidding. But then I realise you are serious.
>>
>> I don't think we need to document every simple class variable/accessor
>> to accept this source code. That is overkill. If you don't understand
>> lnL or alpha, don't use it. We are not in the business of documenting
>> for documenting's sake. ?Documenting lnL and alpha will be like:
>>
>> "Retrieve the lnL value from the Report"
>>
>> "Retrieve the alpha value from the Report"
>>
>> etc. etc. I don't think we should be doing that. Standard 1?to-1
>> relations are obvious and don't need lots of text in the code base.
>>
>> If someone feels like filling in these obvious statements, fine. It
>> really goes against my grain. Do we document every single accessor?
>> Note the previous implementation did no such thing. That code was
>> accepted fine (and partially written by you).
>>
>> > Details of +buf+ (class, contents, etc) should also be written in RDoc.
>> > It is recommended to use the style written in the README_DEV.rdoc, or
>> > the style used in the Ruby source code.
>>
>> You mean the contents of the input buffer, which is the content of the
>> input file? I see many places in Bioruby where no such a thing is
>> done. ?Why become strict on this now? If you want a different
>> descriptive name for the variable - that is fine. Propose me
>> a better name.
>>
>> > > ? ? ?def to_a
>> > > ? ? ? ?[ @position, @aaref, @probability, @omega ]
>> > > ? ? ?end
>> > What is the purpose of the method?
>>
>> Access converter. Convenience, really. You can remove it if you
>> dislike it so much. I use it for testing and to write to a file. Could
>> be to_s too, but that fixates the format.
>>
>> > > ? ?class PositiveSites < Array
>> >
>> > To inherit Array and to create original container class is discouraged.
>> > In BioRuby, we have deprecated Bio::Features and Bio::References in
>> > version 1.3.0, although they do not inherit Array but have an array
>> > in the object. (The classes still exist only for backward compatibility,
>> > in lib/bio/compat/features.rb and references.rb).
>>
>> PositiveSites object has the all the features of a list (ie Array). I
>> think inheritance is what it should be. It is an is_a relationship.
>> Adding a @list will just add code. Not only for initialization, but
>> also for iterators. I only see how we can move backwards from readable
>> code. Nor is it good OOP practice. Inheritance is not *always* bad,
>> though I agree it is used too quickly (in general).
>>
>> > In this case, except initialize, only a method named "graph" is added.
>> > I think it is good to add the graph method in the Report class and
>> > using an Array for storing PositiveSite objects.
>>
>> This is awful. The graph is a feature of PositiveSites, and not of the
>> report *parser*. To keep things simple it is best practise to have
>> functionality where it belongs. It is good OOP design. Your proposal
>> means the Report class becomes less obvious in what it is. Look how
>> clean it is now!
>>
>> What do other people think on this list. I am at a disadvantage here.
>>
>> I would like this code accepted in Bioruby, so other people can use
>> it. I disagree with most of above 'criticism'. I certainly balk at the
>> last non-OOP ones. This is not the first time I am really unhappy. I
>> can't believe how much trouble I have to go to for a simple class,
>> which, as it happens, has a perfectly acceptable implementation by
>> most measures.
>>
>> Pj.
>>
>> _______________________________________________
>> BioRuby Project - http://www.bioruby.org/
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
>>
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>


From jan.aerts at gmail.com  Fri Jan  8 11:29:07 2010
From: jan.aerts at gmail.com (Jan Aerts)
Date: Fri, 8 Jan 2010 16:29:07 +0000
Subject: [BioRuby] Codeml parser
In-Reply-To: <4c7507a71001050353v4e4654abn79f9cbcaeef262e2@mail.gmail.com>
References: <20091231141546.GA5770@thebird.nl>
	<20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp>
	<20100104115105.GA21035@thebird.nl>
	<20100104170904.GA26187@thebird.nl>
	<20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp>
	<20100105103212.GA4584@thebird.nl>
	<4c7507a71001050353v4e4654abn79f9cbcaeef262e2@mail.gmail.com>
Message-ID: <4c7507a71001080829u179e48f9jfc81b0590fd34872@mail.gmail.com>

Maybe it'd be a good idea to start thinking at a level removed from actual
code, and create some general design documents first. Maybe we should
* describe what we actually want to achieve with the bioruby toolkit: should
it be a library foremost, or should it rather be an interface to run other
programs (e.g. BLAST)?
* make a high-level overview of different parts of bioruby:
  - how do we handle file formats: are the files actual objects, or do they
merely describe a biological entity? E.g. does a FASTA file merit the
instantiation of a FASTA object, or is it nothing more than a container of
Sequence objects?
  - how do different parts of the library interact? Should we have a Root
class such as in bioperl? What type of class should be used to interface
with the world (e.g. file parsing)? What type of class should be used to
actually contain the object data (e.g. annotated sequence)?

When that's done: come up with general guidelines for coding, e.g. always
use keyword-based argument lists or something (just an example).

jan.

2010/1/5 Jan Aerts <jan.aerts at gmail.com>

> All,
>
> It appears that the pre-hackathon bioruby meeting will be very useful :-)
> Why don't we use that time to focus on the bit-more-distant future of
> bioruby: bioruby 2.0? We could discuss what it should look like without
> having to worry about backward compatibility. Topics:
> * documentation style (I happen to agree with Naohisa on that)
> * class hierarchy: how would we organize the information if we had to start
> from scratch? (maybe we should follow bioperl's lead with a Root class?)
> * coding style
> * general interface decisions
> * ...
>
> jan.
>
> PS: Still don't know if I can make it to Japan. Will know this afternoon
> (broken foot might interfere...)
>
> 2010/1/5 Pjotr Prins <pjotr.public14 at thebird.nl>
>
> Hi Naohisa,
>>
>> First I thought you were kidding. But then I realise you are serious.
>>
>> I don't think we need to document every simple class variable/accessor
>> to accept this source code. That is overkill. If you don't understand
>> lnL or alpha, don't use it. We are not in the business of documenting
>> for documenting's sake.  Documenting lnL and alpha will be like:
>>
>> "Retrieve the lnL value from the Report"
>>
>> "Retrieve the alpha value from the Report"
>>
>> etc. etc. I don't think we should be doing that. Standard 1?to-1
>> relations are obvious and don't need lots of text in the code base.
>>
>> If someone feels like filling in these obvious statements, fine. It
>> really goes against my grain. Do we document every single accessor?
>> Note the previous implementation did no such thing. That code was
>> accepted fine (and partially written by you).
>>
>> > Details of +buf+ (class, contents, etc) should also be written in RDoc.
>> > It is recommended to use the style written in the README_DEV.rdoc, or
>> > the style used in the Ruby source code.
>>
>> You mean the contents of the input buffer, which is the content of the
>> input file? I see many places in Bioruby where no such a thing is
>> done.  Why become strict on this now? If you want a different
>> descriptive name for the variable - that is fine. Propose me
>> a better name.
>>
>> > >      def to_a
>> > >        [ @position, @aaref, @probability, @omega ]
>> > >      end
>> > What is the purpose of the method?
>>
>> Access converter. Convenience, really. You can remove it if you
>> dislike it so much. I use it for testing and to write to a file. Could
>> be to_s too, but that fixates the format.
>>
>> > >    class PositiveSites < Array
>> >
>> > To inherit Array and to create original container class is discouraged.
>> > In BioRuby, we have deprecated Bio::Features and Bio::References in
>> > version 1.3.0, although they do not inherit Array but have an array
>> > in the object. (The classes still exist only for backward compatibility,
>> > in lib/bio/compat/features.rb and references.rb).
>>
>> PositiveSites object has the all the features of a list (ie Array). I
>> think inheritance is what it should be. It is an is_a relationship.
>> Adding a @list will just add code. Not only for initialization, but
>> also for iterators. I only see how we can move backwards from readable
>> code. Nor is it good OOP practice. Inheritance is not *always* bad,
>> though I agree it is used too quickly (in general).
>>
>> > In this case, except initialize, only a method named "graph" is added.
>> > I think it is good to add the graph method in the Report class and
>> > using an Array for storing PositiveSite objects.
>>
>> This is awful. The graph is a feature of PositiveSites, and not of the
>> report *parser*. To keep things simple it is best practise to have
>> functionality where it belongs. It is good OOP design. Your proposal
>> means the Report class becomes less obvious in what it is. Look how
>> clean it is now!
>>
>> What do other people think on this list. I am at a disadvantage here.
>>
>> I would like this code accepted in Bioruby, so other people can use
>> it. I disagree with most of above 'criticism'. I certainly balk at the
>> last non-OOP ones. This is not the first time I am really unhappy. I
>> can't believe how much trouble I have to go to for a simple class,
>> which, as it happens, has a perfectly acceptable implementation by
>> most measures.
>>
>> Pj.
>>
>> _______________________________________________
>> BioRuby Project - http://www.bioruby.org/
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
>>
>
>


From pjotr.public14 at thebird.nl  Fri Jan  8 12:21:32 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Fri, 8 Jan 2010 18:21:32 +0100
Subject: [BioRuby] Codeml parser
In-Reply-To: <4c7507a71001080829u179e48f9jfc81b0590fd34872@mail.gmail.com>
References: <20091231141546.GA5770@thebird.nl>
	<20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp>
	<20100104115105.GA21035@thebird.nl>
	<20100104170904.GA26187@thebird.nl>
	<20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp>
	<20100105103212.GA4584@thebird.nl>
	<4c7507a71001050353v4e4654abn79f9cbcaeef262e2@mail.gmail.com>
	<4c7507a71001080829u179e48f9jfc81b0590fd34872@mail.gmail.com>
Message-ID: <20100108172132.GA28895@thebird.nl>

On Fri, Jan 08, 2010 at 04:29:07PM +0000, Jan Aerts wrote:
> Maybe it'd be a good idea to start thinking at a level removed from actual
> code, and create some general design documents first. Maybe we should
> * describe what we actually want to achieve with the bioruby toolkit: should
> it be a library foremost, or should it rather be an interface to run other
> programs (e.g. BLAST)?

I think calling into other programs is a good feature, but should be
really split out. Likewise for web services. Both split in terms of
objects and directory layout. Currently there is too intertwined
functionality.

Then there is support for reading and writing standard formats.

Then there is extra functionality (not found elsewhere, perhaps).

And we have Rails support and the shell.

All these should be clearly split out.

I don't think we have to choose. We can have it all. Just make sure
it sits in the right location.

> * make a high-level overview of different parts of bioruby:
>   - how do we handle file formats: are the files actual objects, or do they
> merely describe a biological entity? E.g. does a FASTA file merit the
> instantiation of a FASTA object, or is it nothing more than a container of
> Sequence objects?
>   - how do different parts of the library interact? Should we have a Root
> class such as in bioperl? What type of class should be used to interface
> with the world (e.g. file parsing)? What type of class should be used to
> actually contain the object data (e.g. annotated sequence)?
> 
> When that's done: come up with general guidelines for coding, e.g. always
> use keyword-based argument lists or something (just an example).

These choices are design choices and have to originate in a list of
shared 'values'. Because if we don't agree on a value there will
always be arguments and disagreement. One value would be 'clear
documentation', but this may collide with 'clear source code'.
Similarly 'Easy to use code' and 'Concise code' may collide. Or
functional choices over OOP. We need to put those values together and
rank them in importance. Once the ranking is set we can make easy
choices in guidelines.

I am writing a type of Manifest. I'll present that in the coming
weeks, when I feel I am ready. It is meant for discussion in Japan,
and after.

Pj.

From pjotr.public14 at thebird.nl  Mon Jan 11 09:40:41 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Mon, 11 Jan 2010 15:40:41 +0100
Subject: [BioRuby] Clustal ALN writer
Message-ID: <20100111144041.GA31684@thebird.nl>

I have created an colorized HTML alignment file with consensus
information and amino acids showing evidence of positive selection
(based on PAML output).

  http://thebird.nl/projects/test_color2.html

I did a write up on the implementation at:

  http://bioruby.open-bio.org/wiki/BIORUBY_ALNCOLOR

Enjoy,

Pj.


From ngoto at gen-info.osaka-u.ac.jp  Tue Jan 12 04:29:57 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Tue, 12 Jan 2010 18:29:57 +0900
Subject: [BioRuby] Clustal ALN writer
In-Reply-To: <20100111144041.GA31684@thebird.nl>
References: <20100111144041.GA31684@thebird.nl>
Message-ID: <20100112092957.A16001CBC49E@idnmail.gen-info.osaka-u.ac.jp>

Hi,

I'm not sure whether the prefix Bio::Html is suitable or not.

By the way, I'v tried some of your code in
http://github.com/pjotrp/bioruby/blob/color-alignment/
and found potential XSS.

  a = Bio::Alignment.new
  a.add_seq('ATCCATGG', '<script>alert("a");</script>')
  a.add_seq('ATGCATGC', '<script>alert("b");</script>')
  a.add_seq('<script>alert("c");</script>', 'c')
  simple = Bio::Html::HtmlAlignment.new(a,
          :title => '<script>alert("title");</script>')
  html = simple.html()
  File.open('/tmp/xss.html', 'w') { |w| w.print html }

For sequences, sequence names, and consensus lines,
using CGI.escapeHTML() will always be needed.

For the :title, if script users can set the title, it
should be escaped, but this prevents script programmers
using html tags in the title.

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

On Mon, 11 Jan 2010 15:40:41 +0100
Pjotr Prins <pjotr.public14 at thebird.nl> wrote:

> I have created an colorized HTML alignment file with consensus
> information and amino acids showing evidence of positive selection
> (based on PAML output).
> 
>   http://thebird.nl/projects/test_color2.html
> 
> I did a write up on the implementation at:
> 
>   http://bioruby.open-bio.org/wiki/BIORUBY_ALNCOLOR
> 
> Enjoy,
> 
> Pj.
> 
> 
> 
> 


From pjotr.public14 at thebird.nl  Tue Jan 12 05:11:32 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Tue, 12 Jan 2010 11:11:32 +0100
Subject: [BioRuby] Bioruby HTML output
Message-ID: <20100112101132.GC10308@thebird.nl>

On Tue, Jan 12, 2010 at 06:29:57PM +0900, Naohisa GOTO wrote:
> I'm not sure whether the prefix Bio::Html is suitable or not.

Me neither ;). This is something to discuss when we meet. See my
write up on partitioning based on functionality or standards.

> By the way, I'v tried some of your code in
> http://github.com/pjotrp/bioruby/blob/color-alignment/
> and found potential XSS.
> 
>   a = Bio::Alignment.new
>   a.add_seq('ATCCATGG', '<script>alert("a");</script>')
>   a.add_seq('ATGCATGC', '<script>alert("b");</script>')
>   a.add_seq('<script>alert("c");</script>', 'c')
>   simple = Bio::Html::HtmlAlignment.new(a,
>           :title => '<script>alert("title");</script>')
>   html = simple.html()
>   File.open('/tmp/xss.html', 'w') { |w| w.print html }
> 
> For sequences, sequence names, and consensus lines,
> using CGI.escapeHTML() will always be needed.
>
> For the :title, if script users can set the title, it
> should be escaped, but this prevents script programmers
> using html tags in the title.

Perhaps the HTML generator should escape its output. Though I
personally think we should only be worried about security concerns
when people *enter* new data on input forms. That is when exploits
show up. I can argue that HTML generation should not concern itself
with HOW the inputs are presented. One advantage of having a
programmer set the 'title' is that he *can* embed HTML. Perhaps
escaping HTML is the responsibility of the programmer providing the
data. And therefore to the logic that handles input.

We have had a similar discussion before. We have to decide to what
level *output* code should concern itself with *input* security. I
have a feeling that too much of Bioruby classes try to do too much.
How do we stay away from cluttering the code? How do we decide that
callers should not use HTML and handle security concerns?

You write:

>   a.add_seq('ATCCATGG', '<script>alert("a");</script>')

If a programmer wants that - it is his concern in my opion. If he is
concerned about exploits he should not allow it. The Alignment class
does not care either. It is none of its business.

BTW I fixed a number of PAML::Codeml bugs on this branch. So you
can ignore the existing PAML branch. Let's continue with the color
coding, assuming you can live with the PAML::Codeml implementation,
as it stands.

Pj.


From donttrustben at gmail.com  Tue Jan 12 07:52:42 2010
From: donttrustben at gmail.com (Ben Woodcroft)
Date: Tue, 12 Jan 2010 22:52:42 +1000
Subject: [BioRuby] SPTR problem
Message-ID: <bb2b67d01001120452t2c9bd748v1992b6d1812d418@mail.gmail.com>

Hi,

While parsing all the yeast UniProt txt files I came across a problem with
the gn parser - it was returning an array when I expected a hash. Looking at
the code the problem seems to be this when statement:

      when /Name=/,/ORFNames=/
        @data['GN'] = gn_uniprot_parser
      else
        @data['GN'] = gn_old_parser
      end

http://www.uniprot.org/uniprot/A2P2R3.txt has the problem on the 5th line:

GN OrderedLocusNames=YMR084W;

So GN line had OrderedLocusNames= but not  Name= or ORFNames=, so it didn't
use the new parser, like the other entries I came across. Should all 4
possibilities be tested for in the when statement: (Synonyms= being the
4th)?

Also, while I'm here:
* why does the returned hash have different keys than are in the file? e.g.
ORFNames becomes :orfs?
* I also found the parsing process for whole genomes quite slow (multiple
hours for well annotated ones).
* is there any standard way to handle concatenated UniProt files? I wrote my
own as it was simple.

Thanks,
ben

--
FYI: My email addresses at unimelb, uq and gmail all redirect to the same
place.

From ngoto at gen-info.osaka-u.ac.jp  Tue Jan 12 21:58:00 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Wed, 13 Jan 2010 11:58:00 +0900
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <20100112101132.GC10308@thebird.nl>
References: <20100112101132.GC10308@thebird.nl>
Message-ID: <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>

Hi,

On Tue, 12 Jan 2010 11:11:32 +0100
Pjotr Prins <pjotr.public14 at thebird.nl> wrote:

> On Tue, Jan 12, 2010 at 06:29:57PM +0900, Naohisa GOTO wrote:
> > I'm not sure whether the prefix Bio::Html is suitable or not.
> 
> Me neither ;). This is something to discuss when we meet. See my
> write up on partitioning based on functionality or standards.
> 
> > By the way, I'v tried some of your code in
> > http://github.com/pjotrp/bioruby/blob/color-alignment/
> > and found potential XSS.
> > 
> >   a = Bio::Alignment.new
> >   a.add_seq('ATCCATGG', '<script>alert("a");</script>')
> >   a.add_seq('ATGCATGC', '<script>alert("b");</script>')
> >   a.add_seq('<script>alert("c");</script>', 'c')
> >   simple = Bio::Html::HtmlAlignment.new(a,
> >           :title => '<script>alert("title");</script>')
> >   html = simple.html()
> >   File.open('/tmp/xss.html', 'w') { |w| w.print html }
> > 
> > For sequences, sequence names, and consensus lines,
> > using CGI.escapeHTML() will always be needed.
> >
> > For the :title, if script users can set the title, it
> > should be escaped, but this prevents script programmers
> > using html tags in the title.
> 
> Perhaps the HTML generator should escape its output. Though I
> personally think we should only be worried about security concerns
> when people *enter* new data on input forms. That is when exploits
> show up. I can argue that HTML generation should not concern itself
> with HOW the inputs are presented. One advantage of having a
> programmer set the 'title' is that he *can* embed HTML. Perhaps
> escaping HTML is the responsibility of the programmer providing the
> data. And therefore to the logic that handles input.

Even apart from security, sequence names (and sequences) that
contain html special characters may not be correctly displayed.

For example, sequences with three parameters a, b, and c.

% cat test.aln
CLUSTAL 2.0.9 multiple sequence alignment


1<a<3_b>5_c<7       FKNVFTVIKTEFNSHRVKIDSHFHGIWIAGAPPEGTDVYIKWFLQ
a>3_5<b<8_c>11      FKNVMSVIKTEFNSHRVKIDSHFHGIWIAGAPPEGTDVYIKTFLQ
                    ****::*********************************** ***
% irb -r bio
irb> report = Bio::ClustalW::Report.new(File.read('test.aln'))
irb> alignment = report.alignment
irb> simple = Bio::Html::HtmlAlignment.new(alignment, :title => 'a,b,c')
irb> File.open('abc.html', 'w') { |w| w.print simple.html() }

The sequence names were correctly treated by ClustalW 2.0.9,
but unexpected representation.

This problem can not be solved with input data escaping.
If the sequence name "1<a<3_b>5_c<7" is escaped to
"1&lt;a&lt;3_b&gt;5_c&lt;7" before calling the method,
text indentation will be broken because of the mismatch of
text length and html display width. To solve this, to
escape when building the html format by output formatting
method will be needed.

> We have had a similar discussion before. We have to decide to what
> level *output* code should concern itself with *input* security. I
> have a feeling that too much of Bioruby classes try to do too much.
> How do we stay away from cluttering the code? How do we decide that
> callers should not use HTML and handle security concerns?

It is difficult not to use HTML-like string which we want
to be treated as normal unformatted string but unexpectedly
treated as HTML by some programs, e.g. the above example.

For security, I'd like to ask security experts.
Anyone in this list?

I think escaping should be done by formatting layer and
should be turned on by default, because:
* Only the output formatting layer knows how the input data
  is processed.
* In many cases, the data comes from outside, and we can not
  expect it is safe enough.
* Different escaping rules are needed for different output types,
  e.g. SQL, html, XML, TeX, GNUPLOT, PostScript, R scripts.
  Escaping by output methods seems natural, and helps to switch
  output formats without concerning escaping issues specific
  to each output format.

> You write:
> 
> >   a.add_seq('ATCCATGG', '<script>alert("a");</script>')
> 
> If a programmer wants that - it is his concern in my opion. If he is
> concerned about exploits he should not allow it. The Alignment class
> does not care either. It is none of its business.

The example is extreme case. For security, please ask experts.
Apart from the security, I wish ">", "<", "&", etc. can be
displayed correctly. I think methods to build HTML format
should concern this.

> BTW I fixed a number of PAML::Codeml bugs on this branch. So you
> can ignore the existing PAML branch. Let's continue with the color
> coding, assuming you can live with the PAML::Codeml implementation,
> as it stands.

When do you want the Bio::PAML::Codeml code to be merged to the
blessed bioruby repository?


Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

From tomoakin at kenroku.kanazawa-u.ac.jp  Wed Jan 13 01:57:11 2010
From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA)
Date: Wed, 13 Jan 2010 15:57:11 +0900
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
References: <20100112101132.GC10308@thebird.nl>
	<20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>

Hi, Happy New Year!

> For security, I'd like to ask security experts.
> Anyone in this list?

Though I am not an expert, in a Japanese blog,
http://takagi-hiromitsu.jp/diary/20051227.html
Hiromitsu Takagi writes the reason why escaping should be default at  
the output point,
from a security points, which sounds me reasonable, though I do not  
know an english
literature.

In addition,

> * Different escaping rules are needed for different output types,
>   e.g. SQL, html, XML, TeX, GNUPLOT, PostScript, R scripts.
>   Escaping by output methods seems natural, and helps to switch
>   output formats without concerning escaping issues specific
>   to each output format.


this is a good argument.
If html tag containing title is necessary, a non-default API that  
does accept
html marked text rather than the normal text should be considered.
-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan


On 2010/01/13, at 11:58, Naohisa GOTO wrote:

> Hi,
>
> On Tue, 12 Jan 2010 11:11:32 +0100
> Pjotr Prins <pjotr.public14 at thebird.nl> wrote:
>
>> On Tue, Jan 12, 2010 at 06:29:57PM +0900, Naohisa GOTO wrote:
>>> I'm not sure whether the prefix Bio::Html is suitable or not.
>>
>> Me neither ;). This is something to discuss when we meet. See my
>> write up on partitioning based on functionality or standards.
>>
>>> By the way, I'v tried some of your code in
>>> http://github.com/pjotrp/bioruby/blob/color-alignment/
>>> and found potential XSS.
>>>
>>>   a = Bio::Alignment.new
>>>   a.add_seq('ATCCATGG', '<script>alert("a");</script>')
>>>   a.add_seq('ATGCATGC', '<script>alert("b");</script>')
>>>   a.add_seq('<script>alert("c");</script>', 'c')
>>>   simple = Bio::Html::HtmlAlignment.new(a,
>>>           :title => '<script>alert("title");</script>')
>>>   html = simple.html()
>>>   File.open('/tmp/xss.html', 'w') { |w| w.print html }
>>>
>>> For sequences, sequence names, and consensus lines,
>>> using CGI.escapeHTML() will always be needed.
>>>
>>> For the :title, if script users can set the title, it
>>> should be escaped, but this prevents script programmers
>>> using html tags in the title.
>>
>> Perhaps the HTML generator should escape its output. Though I
>> personally think we should only be worried about security concerns
>> when people *enter* new data on input forms. That is when exploits
>> show up. I can argue that HTML generation should not concern itself
>> with HOW the inputs are presented. One advantage of having a
>> programmer set the 'title' is that he *can* embed HTML. Perhaps
>> escaping HTML is the responsibility of the programmer providing the
>> data. And therefore to the logic that handles input.
>
> Even apart from security, sequence names (and sequences) that
> contain html special characters may not be correctly displayed.
>
> For example, sequences with three parameters a, b, and c.
>
> % cat test.aln
> CLUSTAL 2.0.9 multiple sequence alignment
>
>
> 1<a<3_b>5_c<7       FKNVFTVIKTEFNSHRVKIDSHFHGIWIAGAPPEGTDVYIKWFLQ
> a>3_5<b<8_c>11      FKNVMSVIKTEFNSHRVKIDSHFHGIWIAGAPPEGTDVYIKTFLQ
>                     ****::*********************************** ***
> % irb -r bio
> irb> report = Bio::ClustalW::Report.new(File.read('test.aln'))
> irb> alignment = report.alignment
> irb> simple = Bio::Html::HtmlAlignment.new(alignment, :title =>  
> 'a,b,c')
> irb> File.open('abc.html', 'w') { |w| w.print simple.html() }
>
> The sequence names were correctly treated by ClustalW 2.0.9,
> but unexpected representation.
>
> This problem can not be solved with input data escaping.
> If the sequence name "1<a<3_b>5_c<7" is escaped to
> "1&lt;a&lt;3_b&gt;5_c&lt;7" before calling the method,
> text indentation will be broken because of the mismatch of
> text length and html display width. To solve this, to
> escape when building the html format by output formatting
> method will be needed.
>
>> We have had a similar discussion before. We have to decide to what
>> level *output* code should concern itself with *input* security. I
>> have a feeling that too much of Bioruby classes try to do too much.
>> How do we stay away from cluttering the code? How do we decide that
>> callers should not use HTML and handle security concerns?
>
> It is difficult not to use HTML-like string which we want
> to be treated as normal unformatted string but unexpectedly
> treated as HTML by some programs, e.g. the above example.
>
> For security, I'd like to ask security experts.
> Anyone in this list?
>
> I think escaping should be done by formatting layer and
> should be turned on by default, because:
> * Only the output formatting layer knows how the input data
>   is processed.
> * In many cases, the data comes from outside, and we can not
>   expect it is safe enough.
> * Different escaping rules are needed for different output types,
>   e.g. SQL, html, XML, TeX, GNUPLOT, PostScript, R scripts.
>   Escaping by output methods seems natural, and helps to switch
>   output formats without concerning escaping issues specific
>   to each output format.
>
>> You write:
>>
>>>   a.add_seq('ATCCATGG', '<script>alert("a");</script>')
>>
>> If a programmer wants that - it is his concern in my opion. If he is
>> concerned about exploits he should not allow it. The Alignment class
>> does not care either. It is none of its business.
>
> The example is extreme case. For security, please ask experts.
> Apart from the security, I wish ">", "<", "&", etc. can be
> displayed correctly. I think methods to build HTML format
> should concern this.
>
>> BTW I fixed a number of PAML::Codeml bugs on this branch. So you
>> can ignore the existing PAML branch. Let's continue with the color
>> coding, assuming you can live with the PAML::Codeml implementation,
>> as it stands.
>
> When do you want the Bio::PAML::Codeml code to be merged to the
> blessed bioruby repository?
>
>
> Naohisa Goto
> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>


From pjotr.public14 at thebird.nl  Wed Jan 13 02:37:06 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Wed, 13 Jan 2010 08:37:06 +0100
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
References: <20100112101132.GC10308@thebird.nl>
	<20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
	<3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
Message-ID: <20100113073706.GA25611@thebird.nl>

Hi all,

OK, I'll adapt the output generator to escape symbols. And I think
you are right it belongs in the generator. There are three scenario's
really:

1. Output that never contains symbols (sequence)
2. Output that can contain symbols, but should be escaped
(descriptions, id's)
3. Output that can contain HTML

In my case I have all three. 

I think with a sequence we can assume the content is a legal string.
Escaping is overkill and (if needed) points to a bigger problem. I
think we should not clutter the code with (1) - or degrade performance
by default.

Case (2) yes!

case (3), like a title or some text to plug in, we should escape by
default, but add a parameter :html_escape == false for the cases the user
wants to plug in HTML.

OK?

Pj.

From tomoakin at kenroku.kanazawa-u.ac.jp  Wed Jan 13 04:44:01 2010
From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA)
Date: Wed, 13 Jan 2010 18:44:01 +0900
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <20100113073706.GA25611@thebird.nl>
References: <20100112101132.GC10308@thebird.nl>
	<20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
	<3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
	<20100113073706.GA25611@thebird.nl>
Message-ID: <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>

Hi,

> I think with a sequence we can assume the content is a legal string.
> Escaping is overkill and (if needed) points to a bigger problem. I
> think we should not clutter the code with (1) - or degrade performance
> by default.


If we are talking on Bio::Html::HtmlAlignment,
it is better to escape even for sequence or matchlines to make
the class more independent of the implementation of alignment class.
Note that sim4 uses >>>...>>> in its matchline, and a future
intron aware amino acid alignment processing program might use
special characters to indicate introns.

If the performance is really a problem and
it is in Bio::Alignment::Output, and the constructor guarantees
that there is no special characters, then the escape may be skipped.
Escaping everything is the default simple program structure and
removing that process is a kind of optimization with some programming  
effort
to guarantee its validity without escaping.
-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan


On 2010/01/13, at 16:37, Pjotr Prins wrote:

> Hi all,
>
> OK, I'll adapt the output generator to escape symbols. And I think
> you are right it belongs in the generator. There are three scenario's
> really:
>
> 1. Output that never contains symbols (sequence)
> 2. Output that can contain symbols, but should be escaped
> (descriptions, id's)
> 3. Output that can contain HTML
>
> In my case I have all three.
>
> I think with a sequence we can assume the content is a legal string.
> Escaping is overkill and (if needed) points to a bigger problem. I
> think we should not clutter the code with (1) - or degrade performance
> by default.
>
> Case (2) yes!
>
> case (3), like a title or some text to plug in, we should escape by
> default, but add a parameter :html_escape == false for the cases  
> the user
> wants to plug in HTML.
>
> OK?
>
> Pj.
>


From pjotr.public14 at thebird.nl  Fri Jan 15 09:00:59 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Fri, 15 Jan 2010 15:00:59 +0100
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
References: <20100112101132.GC10308@thebird.nl>
	<20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
	<3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
	<20100113073706.GA25611@thebird.nl>
	<5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
Message-ID: <20100115140059.GA24948@thebird.nl>

On second thought, escaping is less obvious than I thought. I can
escape all generated HTML, but that leaves no way to customize the
output. Say I want to include an href in a sequence descriptor - which
is a fairly typical requirement - that would be disabled. Likewise if
someone wants to customize the title or footer - or even the
information on the match_line. 

The problem here is that we are defining use - forcing the generated
HTML into a straight jacket by adding business logic. 

Are we really telling our users not to use HTML in sequence
descriptors, even if it is tied to one type of output?

I don't like it.

I am going to add a 'master' switch for escaping of HTML. The default
will be with escaping.

Pj.

On Wed, Jan 13, 2010 at 06:44:01PM +0900, Tomoaki NISHIYAMA wrote:
> Hi,
>
>> I think with a sequence we can assume the content is a legal string.
>> Escaping is overkill and (if needed) points to a bigger problem. I
>> think we should not clutter the code with (1) - or degrade performance
>> by default.
>
>
> If we are talking on Bio::Html::HtmlAlignment,
> it is better to escape even for sequence or matchlines to make
> the class more independent of the implementation of alignment class.
> Note that sim4 uses >>>...>>> in its matchline, and a future
> intron aware amino acid alignment processing program might use
> special characters to indicate introns.
>
> If the performance is really a problem and
> it is in Bio::Alignment::Output, and the constructor guarantees
> that there is no special characters, then the escape may be skipped.
> Escaping everything is the default simple program structure and
> removing that process is a kind of optimization with some programming  
> effort
> to guarantee its validity without escaping.
> -- 
> Tomoaki NISHIYAMA
>
> Advanced Science Research Center,
> Kanazawa University,
> 13-1 Takara-machi,
> Kanazawa, 920-0934, Japan
>
>
> On 2010/01/13, at 16:37, Pjotr Prins wrote:
>
>> Hi all,
>>
>> OK, I'll adapt the output generator to escape symbols. And I think
>> you are right it belongs in the generator. There are three scenario's
>> really:
>>
>> 1. Output that never contains symbols (sequence)
>> 2. Output that can contain symbols, but should be escaped
>> (descriptions, id's)
>> 3. Output that can contain HTML
>>
>> In my case I have all three.
>>
>> I think with a sequence we can assume the content is a legal string.
>> Escaping is overkill and (if needed) points to a bigger problem. I
>> think we should not clutter the code with (1) - or degrade performance
>> by default.
>>
>> Case (2) yes!
>>
>> case (3), like a title or some text to plug in, we should escape by
>> default, but add a parameter :html_escape == false for the cases the 
>> user
>> wants to plug in HTML.
>>
>> OK?
>>
>> Pj.
>>
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby

From ngoto at gen-info.osaka-u.ac.jp  Fri Jan 15 12:19:12 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Sat, 16 Jan 2010 02:19:12 +0900
Subject: [BioRuby] SPTR problem
In-Reply-To: <bb2b67d01001120452t2c9bd748v1992b6d1812d418@mail.gmail.com>
References: <bb2b67d01001120452t2c9bd748v1992b6d1812d418@mail.gmail.com>
Message-ID: <20100115171913.4396B1CBC40C@idnmail.gen-info.osaka-u.ac.jp>

Hi,

On Tue, 12 Jan 2010 22:52:42 +1000
Ben Woodcroft <donttrustben at gmail.com> wrote:

> Hi,
> 
> While parsing all the yeast UniProt txt files I came across a problem with
> the gn parser - it was returning an array when I expected a hash. Looking at
> the code the problem seems to be this when statement:
> 
>       when /Name=/,/ORFNames=/
>         @data['GN'] = gn_uniprot_parser
>       else
>         @data['GN'] = gn_old_parser
>       end
> 
> http://www.uniprot.org/uniprot/A2P2R3.txt has the problem on the 5th line:
> 
> GN OrderedLocusNames=YMR084W;
> 
> So GN line had OrderedLocusNames= but not  Name= or ORFNames=, so it didn't
> use the new parser, like the other entries I came across. Should all 4
> possibilities be tested for in the when statement: (Synonyms= being the
> 4th)?

It seems to be a bug. Perhaps there were no (or very few) entries
which only had OrderedLocusNames= when the code was first written
in 2005. The commit Id in git was b5c3342437ed698f215a87ea72c6cabf0575709d.

The GN format was changed in UniProtKB release 2.0 of 05-Jul-2004. 
The document http://www.uniprot.org/docs/sp_news.htm says:
| The new format of the GN line is:
| 
| GN   Name=<name>; Synonyms=<name1>[, <name2>...]; OrderedLocusNames=<name1>[, <name2>...];
| GN   ORFNames=<name1>[, <name2>...];
| 
| None of the above four tokens are mandatory. But a "Synonyms" token can only be present if there is a "Name" token.

You are right the 4 possibilities should be considered.
"Synonyms" can be eliminated, but it may be safe to be included.

> Also, while I'm here:
> * why does the returned hash have different keys than are in the file? e.g.
> ORFNames becomes :orfs?

I don't know. Now, I think using the same names as described
in the original entries may be preferred, too.

> * I also found the parsing process for whole genomes quite slow (multiple
> hours for well annotated ones).

Please use profiler to find bottlenecks.
 % ruby -rprofile xxx.rb

> * is there any standard way to handle concatenated UniProt files? I wrote my
> own as it was simple.

What type of "concatenated" do you mean?
For simple concatenation, for example, original file distributed
from UniProt FTP site, Bio::FlatFile can be used.
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz
(please gunzip before reading!)

 ff = Bio::FlatFile.open("uniprot_sprot.dat")
 ff.each do |e|
   puts e.entry_id
 end

> 
> Thanks,
> ben

Thank you.

-- 
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

From tomoakin at kenroku.kanazawa-u.ac.jp  Sat Jan 16 00:36:02 2010
From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA)
Date: Sat, 16 Jan 2010 14:36:02 +0900
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <20100115140059.GA24948@thebird.nl>
References: <20100112101132.GC10308@thebird.nl>
	<20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
	<3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
	<20100113073706.GA25611@thebird.nl>
	<5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
Message-ID: <4B515042.7020204@kenroku.kanazawa-u.ac.jp>

Hi,
Pjotr Prins wrote:
 > On second thought, escaping is less obvious than I thought. I can
 > escape all generated HTML, but that leaves no way to customize the
 > output. Say I want to include an href in a sequence descriptor - which
 > is a fairly typical requirement - that would be disabled.

I agree this. Having a link to original sequence on the name
is usually good idea.

 > I am going to add a 'master' switch for escaping of HTML. The default
 > will be with escaping.

How do you think to test if the object responds to to_html
and then call to_html else pass to escapeHTML.
The object may internally plain text and htmlized text or
plain text plus link information or just the plain text
but cares how is output as html inline element.

If properly imlemented, it can generate a link from "gi|112233|..."
within a text and cache for the converted result.

The object can also simply pass the user supplied html.

I think it is a predictable use that user supplied sequence be aligned
with sequences obtained from databases. Isn't it better to be able to
regard user supplied text as a simple text but the sequence from 
databases having proper link?  This may not be simple with a master switch.


From pjotr.public14 at thebird.nl  Sat Jan 16 03:30:41 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Sat, 16 Jan 2010 09:30:41 +0100
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <4B515042.7020204@kenroku.kanazawa-u.ac.jp>
References: <20100112101132.GC10308@thebird.nl>
	<20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
	<3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
	<20100113073706.GA25611@thebird.nl>
	<5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
Message-ID: <20100116083041.GA2663@thebird.nl>

On Sat, Jan 16, 2010 at 02:36:02PM +0900, Tomoaki NISHIYAMA wrote:
> > I am going to add a 'master' switch for escaping of HTML. The default
> > will be with escaping.
>
> How do you think to test if the object responds to to_html
> and then call to_html else pass to escapeHTML.

In this case the object to convert to HTML is a String and part of
Bio::Alignment. Later implementations of Bio::Alignment could use a
Bio::Sequence.id (or something Naohisa wrote me).  It would mean we
would have to create a Bio::Sequence::Descriptor object, which would
contain several specialistic 'output' generators.

This is a recurrent idea we need to discuss.

I think *all* HTML based stuff should be in its own objects - and its
own tree (I have created bio/output/html for that purpose).

I think it is a bad idea to clutter regular BioRuby code with HTML
specific stuff. Likewise for other outputs, as you pointed out, like
plotting. Output should live in

  bio/lib/output/html
  bio/lib/output/plot
  bio/lib/output/gtk
  bio/lib/output/rails (perhaps)
  (etc)

that way display code never pollutes the simple Bio::Sequence object,
for example. You'll get Bio::Html::Sequence for that - or my
preferred naming Bio::HtmlSequence.

Now if Bio::HtmlSequence could be plugged into Bio::Alignment - the
latter would not care - and we could adapt the HtmlSequence info to
show embedded hrefs. 

That would be the proper way to handle it. No testing of methods
(like to_html), but use the object structure to define what is
supported (and not).

Until we implement that (get Bio::Alignment to support arbitrary
Sequence objects) I think the master switch is fine. I have updated
my branch. Default behaviour is escaping. If a user (like me) wants
it otherwise, it is allowed.

Pj.

From tomoakin at kenroku.kanazawa-u.ac.jp  Sun Jan 17 00:12:35 2010
From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA)
Date: Sun, 17 Jan 2010 14:12:35 +0900
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <20100116083041.GA2663@thebird.nl>
References: <20100112101132.GC10308@thebird.nl>
	<20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
	<3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
	<20100113073706.GA25611@thebird.nl>
	<5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
	<20100116083041.GA2663@thebird.nl>
Message-ID: <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>

Hi,

On 2010/01/16, at 17:30, Pjotr Prins wrote:

> On Sat, Jan 16, 2010 at 02:36:02PM +0900, Tomoaki NISHIYAMA wrote:
>>> I am going to add a 'master' switch for escaping of HTML. The  
>>> default
>>> will be with escaping.
>>
>> How do you think to test if the object responds to to_html
>> and then call to_html else pass to escapeHTML.
>
> In this case the object to convert to HTML is a String and part of
> Bio::Alignment. Later implementations of Bio::Alignment could use a
> Bio::Sequence.id (or something Naohisa wrote me).  It would mean we
> would have to create a Bio::Sequence::Descriptor object, which would
> contain several specialistic 'output' generators.


For the meanwhile I don't expect that sophisticated mechanism to
automatically generate proper HTML, but simply add a mean to
distinguish what should be escaped as a normal course and what
is specifically prepared as html by the user.

A user can write:

class HTMLString < String
   def to_html
     self
   end
end

a = Bio::Alignment.new
a.add_seq('ATCCATGG', HTMLString.new('<a href="http://example.com/ 
path/to/original/seqinfo"><em>a</em></a>'))
# this is html under the responsibility of the programmer

a.add_seq('ATGCATGC', '<b>')
# this is not html; don't care on '<', or '>'

simple = Bio::Html::HtmlAlignment.new(a,
   :title => HTMLString.new('A <em>fancy</em> <b>HTML</b> <i>title</ 
i>'))
html = simple.html()

If Bio::Alignment does not force the object given to be String,
such code should be possible without the change in Bio::Alignment,
and only the HtmlAlignment class and the programmer needs to know it.
So, HTML specific code does not need go to regular BioRuby code.

> That would be the proper way to handle it. No testing of methods
> (like to_html), but use the object structure to define what is
> supported (and not).


I'm not sure what do you mean by "use the object structure".
How do you distinguish a plain text and HTML text?
-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan


On 2010/01/16, at 17:30, Pjotr Prins wrote:

> On Sat, Jan 16, 2010 at 02:36:02PM +0900, Tomoaki NISHIYAMA wrote:
>>> I am going to add a 'master' switch for escaping of HTML. The  
>>> default
>>> will be with escaping.
>>
>> How do you think to test if the object responds to to_html
>> and then call to_html else pass to escapeHTML.
>
> In this case the object to convert to HTML is a String and part of
> Bio::Alignment. Later implementations of Bio::Alignment could use a
> Bio::Sequence.id (or something Naohisa wrote me).  It would mean we
> would have to create a Bio::Sequence::Descriptor object, which would
> contain several specialistic 'output' generators.
>
> This is a recurrent idea we need to discuss.
>
> I think *all* HTML based stuff should be in its own objects - and its
> own tree (I have created bio/output/html for that purpose).
>
> I think it is a bad idea to clutter regular BioRuby code with HTML
> specific stuff. Likewise for other outputs, as you pointed out, like
> plotting. Output should live in
>
>   bio/lib/output/html
>   bio/lib/output/plot
>   bio/lib/output/gtk
>   bio/lib/output/rails (perhaps)
>   (etc)
>
> that way display code never pollutes the simple Bio::Sequence object,
> for example. You'll get Bio::Html::Sequence for that - or my
> preferred naming Bio::HtmlSequence.
>
> Now if Bio::HtmlSequence could be plugged into Bio::Alignment - the
> latter would not care - and we could adapt the HtmlSequence info to
> show embedded hrefs.
>
> That would be the proper way to handle it. No testing of methods
> (like to_html), but use the object structure to define what is
> supported (and not).
>
> Until we implement that (get Bio::Alignment to support arbitrary
> Sequence objects) I think the master switch is fine. I have updated
> my branch. Default behaviour is escaping. If a user (like me) wants
> it otherwise, it is allowed.
>
> Pj.
>


From pjotr.public14 at thebird.nl  Sun Jan 17 08:54:41 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Sun, 17 Jan 2010 14:54:41 +0100
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>
References: <20100112101132.GC10308@thebird.nl>
	<20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
	<3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
	<20100113073706.GA25611@thebird.nl>
	<5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
	<20100116083041.GA2663@thebird.nl>
	<63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>
Message-ID: <20100117135441.GA24341@thebird.nl>

Hi Tomoaki,

Thanks for you responses. I really appreciate it.

On Sun, Jan 17, 2010 at 02:12:35PM +0900, Tomoaki NISHIYAMA wrote:
> A user can write:
>
> class HTMLString < String
>   def to_html
>     self
>   end
> end
>
> a = Bio::Alignment.new
> a.add_seq('ATCCATGG', HTMLString.new('<a href="http://example.com/ 
> path/to/original/seqinfo"><em>a</em></a>'))

There is at least one 'problem' with this approach.

This assumes that Bio::Alignment will keep its current implementation.
Currently Bio::Alignment stores a list of descriptions, and a list of
sequences. As Naohisa wrote me two weeks ago, this is before
Bio::Sequence had its own identifier/descriptor. If we redesign
Bio::Alignment there is a large chance we will store Bio::Sequence
instead of two lists (I, for one, would certainly favour that).

The other problem is more about OOP. In your example you say once it
is an HTML object (HTMLString) and next you add a specific method for
html 'to_html'. Twice it is 'told' that it generates HTML. 'to_html'
also implies something of a transformation. We should opt for a
different method name (generate_html, perhaps, or html)

class HTMLString
  def html
  end
end

The 'responsibility' of the output is with HTMLString. Good. This way an
implementation of Bio::Alignment does not need to know about HTML,
but still can generate the output, at the user's request.

> # this is html under the responsibility of the programmer
>
> a.add_seq('ATGCATGC', '<b>')
> # this is not html; don't care on '<', or '>'
>
> simple = Bio::Html::HtmlAlignment.new(a,
>   :title => HTMLString.new('A <em>fancy</em> <b>HTML</b> <i>title</i>'))
> html = simple.html()
>
> If Bio::Alignment does not force the object given to be String,
> such code should be possible without the change in Bio::Alignment,
> and only the HtmlAlignment class and the programmer needs to know it.
> So, HTML specific code does not need go to regular BioRuby code.

HTMLAlignment should not care either how the HTML is generated.. It is
really up to the container holding the sequence, or description, what
the output is.

What I don't like about proposed approach is that HTMLAlignment gets
an object, needs to check for an 'to_html or html' method (ugly), and
if it does not exist, needs to escape the information (by calling the
to_s method?). That is a lot of formal checking I need to do for
every output generated.

>> That would be the proper way to handle it. No testing of methods
>> (like to_html), but use the object structure to define what is
>> supported (and not).
>
> I'm not sure what do you mean by "use the object structure".
> How do you distinguish a plain text and HTML text?

The output is generated by an HTML aware container. We can agree to
use one method 'html' method.

Create different types of objects:

  HTMLSequence.html - generates formatted HTML
  ColorHTMLSequence.html - generates formatted color HTML
  EscapedHTMLSequence.html - generated escaped native stuff

And if someone wanted it, he could create:

  Sequence.html  - generates plain text

This would prevent downstream 'checking' of object responsibilities.
We can assume the user knows he is going to use HTMLAlignment and
therefore we can expect him to pass in a known HTML supported
Sequence object.

The reason to get the responsibility in the right place is to create
as clean as possible code. You really don't want downstream checking
of methods.

We can further discuss in Japan. At least it is clear we have several
options.

Pj.


> -- 
> Tomoaki NISHIYAMA
>
> Advanced Science Research Center,
> Kanazawa University,
> 13-1 Takara-machi,
> Kanazawa, 920-0934, Japan
>
>
> On 2010/01/16, at 17:30, Pjotr Prins wrote:
>
>> On Sat, Jan 16, 2010 at 02:36:02PM +0900, Tomoaki NISHIYAMA wrote:
>>>> I am going to add a 'master' switch for escaping of HTML. The  
>>>> default
>>>> will be with escaping.
>>>
>>> How do you think to test if the object responds to to_html
>>> and then call to_html else pass to escapeHTML.
>>
>> In this case the object to convert to HTML is a String and part of
>> Bio::Alignment. Later implementations of Bio::Alignment could use a
>> Bio::Sequence.id (or something Naohisa wrote me).  It would mean we
>> would have to create a Bio::Sequence::Descriptor object, which would
>> contain several specialistic 'output' generators.
>>
>> This is a recurrent idea we need to discuss.
>>
>> I think *all* HTML based stuff should be in its own objects - and its
>> own tree (I have created bio/output/html for that purpose).
>>
>> I think it is a bad idea to clutter regular BioRuby code with HTML
>> specific stuff. Likewise for other outputs, as you pointed out, like
>> plotting. Output should live in
>>
>>   bio/lib/output/html
>>   bio/lib/output/plot
>>   bio/lib/output/gtk
>>   bio/lib/output/rails (perhaps)
>>   (etc)
>>
>> that way display code never pollutes the simple Bio::Sequence object,
>> for example. You'll get Bio::Html::Sequence for that - or my
>> preferred naming Bio::HtmlSequence.
>>
>> Now if Bio::HtmlSequence could be plugged into Bio::Alignment - the
>> latter would not care - and we could adapt the HtmlSequence info to
>> show embedded hrefs.
>>
>> That would be the proper way to handle it. No testing of methods
>> (like to_html), but use the object structure to define what is
>> supported (and not).
>>
>> Until we implement that (get Bio::Alignment to support arbitrary
>> Sequence objects) I think the master switch is fine. I have updated
>> my branch. Default behaviour is escaping. If a user (like me) wants
>> it otherwise, it is allowed.
>>
>> Pj.
>>
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby

From donttrustben at gmail.com  Mon Jan 18 21:15:30 2010
From: donttrustben at gmail.com (Ben Woodcroft)
Date: Tue, 19 Jan 2010 12:15:30 +1000
Subject: [BioRuby] SPTR problem
In-Reply-To: <20100115171913.4396B1CBC40C@idnmail.gen-info.osaka-u.ac.jp>
References: <bb2b67d01001120452t2c9bd748v1992b6d1812d418@mail.gmail.com>
	<20100115171913.4396B1CBC40C@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <bb2b67d01001181815h149b1539x2f9701670cca8ecc@mail.gmail.com>

Hi,

Thanks for the response. embedded.

2010/1/16 Naohisa GOTO <ngoto at gen-info.osaka-u.ac.jp>

>
> It seems to be a bug. Perhaps there were no (or very few) entries
> which only had OrderedLocusNames= when the code was first written
> in 2005. The commit Id in git was b5c3342437ed698f215a87ea72c6cabf0575709d.
>

I was figuring that. Also, since no actual exception was thrown, errors
might not have been noticed. I wrote a patch for this that I've been using
internally, but haven't included unit tests.
http://github.com/wwood/bioruby/commit/b2f6cb0b
Happy to write tests, but you seem to rewrite my patches anyway..


>
> The GN format was changed in UniProtKB release 2.0 of 05-Jul-2004.
> The document http://www.uniprot.org/docs/sp_news.htm says:
> | The new format of the GN line is:
> |
> | GN   Name=<name>; Synonyms=<name1>[, <name2>...];
> OrderedLocusNames=<name1>[, <name2>...];
> | GN   ORFNames=<name1>[, <name2>...];
> |
> | None of the above four tokens are mandatory. But a "Synonyms" token can
> only be present if there is a "Name" token.
>
> You are right the 4 possibilities should be considered.
> "Synonyms" can be eliminated, but it may be safe to be included.
>
> > Also, while I'm here:
> > * why does the returned hash have different keys than are in the file?
> e.g.
> > ORFNames becomes :orfs?
>
> I don't know. Now, I think using the same names as described
> in the original entries may be preferred, too.
>

What do you suggest we do about this?


>
> > * I also found the parsing process for whole genomes quite slow (multiple
> > hours for well annotated ones).
>
> Please use profiler to find bottlenecks.
>  % ruby -rprofile xxx.rb
>

I tried to do something like that but in the end found it easier to pre-grep
the uniprot file, keeping only the lines relevant to me. There was too many
levels of indirection in my code for me to bother tracking it down.


>
> > * is there any standard way to handle concatenated UniProt files? I wrote
> my
> > own as it was simple.
>
> What type of "concatenated" do you mean?
> For simple concatenation, for example, original file distributed
> from UniProt FTP site, Bio::FlatFile can be used.
>
> ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz
> (please gunzip before reading!)
>
>  ff = Bio::FlatFile.open("uniprot_sprot.dat")
>  ff.each do |e|
>   puts e.entry_id
>  end
>

More evidence I'm an idiot. Like I needed any.
Thanks,
ben

From pjotr.public14 at thebird.nl  Tue Jan 19 05:50:56 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Tue, 19 Jan 2010 11:50:56 +0100
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <20100117135441.GA24341@thebird.nl>
References: <20100112101132.GC10308@thebird.nl>
	<20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
	<3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
	<20100113073706.GA25611@thebird.nl>
	<5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
	<20100116083041.GA2663@thebird.nl>
	<63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>
	<20100117135441.GA24341@thebird.nl>
Message-ID: <20100119105056.GA29525@thebird.nl>

Based on Tomoaki's comments I propose the following:

The requirements are:

  A- input objects that know about HTML should generate that
  B- other input files get escapeHTML(object.to_s)

For a container/displayer to recognize object A, object A should have
a method to_html:

  class ObjectA 
    def to_html
    end
  end

If to_html does not exist to_s is called - and escaped. The principle
will go into a mixin for the container class.

Everyone OK with this? 

Pj.

From ktym at hgc.jp  Tue Jan 19 07:41:31 2010
From: ktym at hgc.jp (Toshiaki Katayama)
Date: Tue, 19 Jan 2010 21:41:31 +0900
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <20100119105056.GA29525@thebird.nl>
References: <20100112101132.GC10308@thebird.nl>
	<20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
	<3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
	<20100113073706.GA25611@thebird.nl>
	<5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
	<20100116083041.GA2663@thebird.nl>
	<63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>
	<20100117135441.GA24341@thebird.nl>
	<20100119105056.GA29525@thebird.nl>
Message-ID: <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp>

Dear Pj and all,

I'm sorry that I could not spare enough time to follow this thread
but I'd like to add some comments.

Firstly, I don't like to use the method name 'to_html' as we already
deprecated to use 'to_fasta' because 'to_' is reserved for conversion
of the class in Ruby's convention (above two methods just convert
String to String).

We (Nakao-san and me) are now working to improve our TogoWS service
(http://togows.dbcls.jp) by supporting RDF output. I hope to propose
a generalized way to achieve this (hopefully, before the BioHackathon
2010 http://hackathon3.dbcls.jp/).

Our current attempt is to have an 'output' method in the Bio::DB class
and each sub-class implements actual 'output_*' methods relevant
to appropriate formats.

# This kind of requirements may also be true for classes other than
# the Bio::DB (for example, Bio::Sequence, Alignment, Newick classes),
# so we may put this interface in the top level class (Bio::Root?),
# which does not exist for now, though.

In TogoWS, we internally use the BioRuby library, and the URI

http://togows.dbcls.jp/entry/exampledb/1/definition

is sent to the 'definition' method defined in the Bio::ExampleDB class.
Similarly, we can map '.' notation in the following URLs to call output
method using their suffix as a format specifier.

http://togows.dbcls.jp/entry/exampledb/1.rdf
http://togows.dbcls.jp/entry/exampledb/1.fasta

Therefore, these can be mapped to output(:rdf) and output(:fasta) method
calls to the Bio::ExampleDB class, respectively.

All we need to do is to add these methods in every database class
comprehensively.

I think this is simple enough and beautiful.
I'll attach a primitive pseudo code in below.
Comments are welcome.

Regards,
Toshiaki Katayama


module Bio
  class DB
    def output(format)
      send("output_#{format.to_s.downcase}")
    end
  end
end

module Bio
  class ExampleDB < DB
    # output sequence of the entry in FASTA format
    def output_fasta
      ">#{@entry_id} #{@definition}\n#{@sequence}\n"
    end

    # output contents of the entry in RDF (N3) format
    def output_rdf
      prefix_subject   = "http://togows.dbcls.jp/entry/exampledb"
      prefix_predicate = "http://togows.dbcls.jp/ontology/exampledb"
      "<#{prefix_subject}/#{@entry_id}>\t<#{prefix_predicate}#definition>\t#{@definition} .\n" +
      "<#{prefix_subject}/#{@entry_id}>\t<#{prefix_predicate}#sequence>\t#{@sequence} .\n"
    end

    # output contents of the entry in HTML format
    def output_html
      "<h1>#{@entry_id}</h1> ... blah, blah, blah ..."
    end
  end
end

entry = Bio::ExampleDB.new(str)

entry.output(:fasta)
# =>
# >ENTRY_ID
# atgcatgcatgcatgcatgc

entry.output(:rdf)
# =>
# <http://togows.dbcls.jp/entry/exampledb/ENTRY_ID>	<http://togows.dbcls.jp/ontology/exampledb#definition>	"DEFINITION" .
# <http://togows.dbcls.jp/entry/exampledb/ENTRY_ID>	<http://togows.dbcls.jp/ontology/exampledb#seqence>	"atgcatgcatgcatgc" .


On 2010/01/19, at 19:50, Pjotr Prins wrote:

> Based on Tomoaki's comments I propose the following:
> 
> The requirements are:
> 
>  A- input objects that know about HTML should generate that
>  B- other input files get escapeHTML(object.to_s)
> 
> For a container/displayer to recognize object A, object A should have
> a method to_html:
> 
>  class ObjectA 
>    def to_html
>    end
>  end
> 
> If to_html does not exist to_s is called - and escaped. The principle
> will go into a mixin for the container class.
> 
> Everyone OK with this? 
> 
> Pj.
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From tomoakin at kenroku.kanazawa-u.ac.jp  Tue Jan 19 09:05:17 2010
From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA)
Date: Tue, 19 Jan 2010 23:05:17 +0900
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp>
References: <20100112101132.GC10308@thebird.nl>
	<20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
	<3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
	<20100113073706.GA25611@thebird.nl>
	<5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
	<20100116083041.GA2663@thebird.nl>
	<63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>
	<20100117135441.GA24341@thebird.nl>
	<20100119105056.GA29525@thebird.nl>
	<0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp>
Message-ID: <F369EC8A-F74A-43CE-A030-B259C5F45B4D@kenroku.kanazawa-u.ac.jp>

Hi,

> Firstly, I don't like to use the method name 'to_html' as we already
> deprecated to use 'to_fasta' because 'to_' is reserved for conversion
> of the class in Ruby's convention (above two methods just convert
> String to String).

I think HTML and String should be actually a different class.
There are to_i and to_f for conversion between subclasses of Numeric,
yet this isn't denied because the conversion is Numeric to Numeric.

a string "<a href=example.com> aaa</a>" in HTML is
"&lt;a href=example.com&gt; aaa&lt;/a&gt;" but
HTML "<a href=example.com> aaa</a>" in HTML is "<a href=example.com>  
aaa</a>"

The return value of to_html should be a different class than String.

So, the point is
>     def output_html
>       "<h1>#{@entry_id}</h1> ... blah, blah, blah ..."
>     end

how to regulate the different behavior of @entry_id.
If the nature of entry_id is plain text, that should be escaped.
On the other hand sometimes the user may want to use html aware
object for whatever purpose (color, link, etc...).
When we want to mix them with data supplied
from outside, say user input into CGI, those data shall usually
be treated as plain text and suppress any interference with html.

#!/usr/local/bin/ruby
require 'bio'
require 'cgi'

class Bio::HTMLString < String
   def to_html
     self
   end
end
def Bio::generate_html(object)
   if object.respond_to?(:to_html)
     object.to_html
   else
     string = CGI.escapeHTML(object.to_s) #fall back to escaping
     Bio::HTMLString.new(string)
   end
end

p Bio::generate_html(12)
p Bio::generate_html(Bio::HTMLString.new('<a href=example.com> aaa</ 
a>'))
p Bio::generate_html('<a href=example.com> aaa</a>')
-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan


From pjotr.public14 at thebird.nl  Tue Jan 19 09:34:22 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Tue, 19 Jan 2010 15:34:22 +0100
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp>
References: <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
	<20100113073706.GA25611@thebird.nl>
	<5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
	<20100116083041.GA2663@thebird.nl>
	<63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>
	<20100117135441.GA24341@thebird.nl>
	<20100119105056.GA29525@thebird.nl>
	<0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp>
Message-ID: <20100119143422.GA1781@thebird.nl>

On Tue, Jan 19, 2010 at 09:41:31PM +0900, Toshiaki Katayama wrote:
> All we need to do is to add these methods in every database class
> comprehensively.
> 
> I think this is simple enough and beautiful.
> I'll attach a primitive pseudo code in below.
> Comments are welcome.

I agree with Tomoaki it is too restrictive. What, indeed, if we want
to present the HTML in a different way?

The second comment is that I dislike the way the current files like
sequence.rb and alignment.rb are mushrooming in size. There is much
too much in there, which discourages people from diving in. I believe
code should be readable, and easy to understand/digest.

Sticking in output 'details', like HTML generation, does not help.

I really would like all HTML to be in one sub-tree. Also XML, RDF and
whatnot. When it is 'business' logic it should be in database. When it
is output transformations it is not 'business' logic any longer.

Don't you think the Sequence, or KEGG, object should not care about
HTML? Or RDF, or plotting? Those are separate functionalities. They
share common access patterns - which are part of the DB class.

Finally, why not use method names? What is the added value of 

  output(:html)

over 

  output_html

Pj.

From ktym at hgc.jp  Tue Jan 19 10:33:30 2010
From: ktym at hgc.jp (Toshiaki Katayama)
Date: Wed, 20 Jan 2010 00:33:30 +0900
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <F369EC8A-F74A-43CE-A030-B259C5F45B4D@kenroku.kanazawa-u.ac.jp>
References: <20100112101132.GC10308@thebird.nl>
	<20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
	<3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
	<20100113073706.GA25611@thebird.nl>
	<5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
	<20100116083041.GA2663@thebird.nl>
	<63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>
	<20100117135441.GA24341@thebird.nl>
	<20100119105056.GA29525@thebird.nl>
	<0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp>
	<F369EC8A-F74A-43CE-A030-B259C5F45B4D@kenroku.kanazawa-u.ac.jp>
Message-ID: <FBDE48DC-0B56-4AFA-BC08-4F198FE48778@hgc.jp>

Nishiyama-san,

I couldn't catch what you are trying to do...
(maybe because I didn't read throughout the thread)

On 2010/01/19, at 23:05, Tomoaki NISHIYAMA wrote:

> Hi,
> 
>> Firstly, I don't like to use the method name 'to_html' as we already
>> deprecated to use 'to_fasta' because 'to_' is reserved for conversion
>> of the class in Ruby's convention (above two methods just convert
>> String to String).
> 
> I think HTML and String should be actually a different class.
> There are to_i and to_f for conversion between subclasses of Numeric,
> yet this isn't denied because the conversion is Numeric to Numeric.
> 
> a string "<a href=example.com> aaa</a>" in HTML is
> "&lt;a href=example.com&gt; aaa&lt;/a&gt;" but
> HTML "<a href=example.com> aaa</a>" in HTML is "<a href=example.com> aaa</a>"
> 
> The return value of to_html should be a different class than String.

If the method is named as to_html, it might return a HTML object.

But, from my view point, a html string is still just a text
and escaping the html string is responsibility of a programmer
depending on where the string will be used.


> 
> So, the point is
>>    def output_html
>>      "<h1>#{@entry_id}</h1> ... blah, blah, blah ..."
>>    end
> 
> how to regulate the different behavior of @entry_id.
> If the nature of entry_id is plain text, that should be escaped.
> On the other hand sometimes the user may want to use html aware
> object for whatever purpose (color, link, etc...).
> When we want to mix them with data supplied
> from outside, say user input into CGI, those data shall usually
> be treated as plain text and suppress any interference with html.

I'm talking about a database class and the contents of
@entry_id is a string parsed from an flat file entry of
that database (not come from outside).


> 
> #!/usr/local/bin/ruby
> require 'bio'
> require 'cgi'
> 
> class Bio::HTMLString < String
>  def to_html
>    self
>  end
> end
> def Bio::generate_html(object)
>  if object.respond_to?(:to_html)
>    object.to_html
>  else
>    string = CGI.escapeHTML(object.to_s) #fall back to escaping
>    Bio::HTMLString.new(string)
>  end
> end
> 
> p Bio::generate_html(12)
> p Bio::generate_html(Bio::HTMLString.new('<a href=example.com> aaa</a>'))
> p Bio::generate_html('<a href=example.com> aaa</a>')

Why we need to have this functionality under the Bio name space?

Toshiaki

> -- 
> Tomoaki NISHIYAMA
> 
> Advanced Science Research Center,
> Kanazawa University,
> 13-1 Takara-machi,
> Kanazawa, 920-0934, Japan
> 


From ktym at hgc.jp  Tue Jan 19 11:21:54 2010
From: ktym at hgc.jp (Toshiaki Katayama)
Date: Wed, 20 Jan 2010 01:21:54 +0900
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <20100119143422.GA1781@thebird.nl>
References: <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
	<20100113073706.GA25611@thebird.nl>
	<5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
	<20100116083041.GA2663@thebird.nl>
	<63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>
	<20100117135441.GA24341@thebird.nl>
	<20100119105056.GA29525@thebird.nl>
	<0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp>
	<20100119143422.GA1781@thebird.nl>
Message-ID: <14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp>

Dear Pj,

On 2010/01/19, at 23:34, Pjotr Prins wrote:

> On Tue, Jan 19, 2010 at 09:41:31PM +0900, Toshiaki Katayama wrote:
>> All we need to do is to add these methods in every database class
>> comprehensively.
>> 
>> I think this is simple enough and beautiful.
>> I'll attach a primitive pseudo code in below.
>> Comments are welcome.
> 
> I agree with Tomoaki it is too restrictive. What, indeed, if we want
> to present the HTML in a different way?

Hmm. Could you provide me some use cases?

Override the output_html method, or, use some template engine to be
more generic.


> 
> The second comment is that I dislike the way the current files like
> sequence.rb and alignment.rb are mushrooming in size. There is much
> too much in there, which discourages people from diving in. I believe
> code should be readable, and easy to understand/digest.

I can agree some files became too large to learn and/or maintain.
But if we try to change the structure of current code base,
we need to define a clean criteria beforehand.

If we separate files into sub files, people then need to look around
the number of files, and it may also slow down the loading speed of
the bioruby library. It is a problem of balance.

In both cases, lack of excellent guide to read through the bioruby
library might be a essential issue.


> 
> Sticking in output 'details', like HTML generation, does not help.
> 
> I really would like all HTML to be in one sub-tree. Also XML, RDF and
> whatnot. When it is 'business' logic it should be in database. When it
> is output transformations it is not 'business' logic any longer.

I'm not sure about HTML but FASTA and RDF, for example, are tightly
related to the original database format/contents. So, I proposed
to have methods to generate formatted string in each database class.

There can be many ways to design OO class trees and to find the best
way to represent/abstract things is always a difficult task.

At some time, we may do refactoring to produce BioRuby 2.0.
Before doing that, we can discuss how to sit all classes/codes cleanly.
We may need someone who understand entire structure/contents of
the current codebase and willing to design a better one with a good sense.


> 
> Don't you think the Sequence, or KEGG, object should not care about
> HTML? Or RDF, or plotting? Those are separate functionalities. They
> share common access patterns - which are part of the DB class.

Again, we can take both approach. My current proposal is conservative one.
Just add these functionalities in each class as the class knows what is in it
and what is the best way to represent the contents.

If we separate formatting/plotting functionalities into separate class,
which might be something like Bio::FlatFile class who knows the header
line format of every database entries. Or we may design better one.

Anyway, I'm now listening. So, please don't stick with HTML things only
and think a global design to which we can plan to migrate.


> 
> Finally, why not use method names? What is the added value of 
> 
>  output(:html)
> 
> over 
> 
>  output_html
> 
> Pj.

Maybe from esthetics viewpoint?

I think it looks better, and, we can easily switch the output format
depending on the context without modifying the code.
Something like a @media property in CSS (screen, print etc.) in mind.

if used_for_semantic_web?
  format = :rdf
  # add some codes to do preparation job for SW
elsif used_for_blast?
  format = :fasta
  # add some codes to do preparation job for blast
end

# we don't need to change the following line in any context
entry.output(format)

Toshiaki


From pjotr.public14 at thebird.nl  Tue Jan 19 15:52:41 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Tue, 19 Jan 2010 21:52:41 +0100
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp>
References: <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
	<20100116083041.GA2663@thebird.nl>
	<63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>
	<20100117135441.GA24341@thebird.nl>
	<20100119105056.GA29525@thebird.nl>
	<0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp>
	<20100119143422.GA1781@thebird.nl>
	<14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp>
Message-ID: <20100119205241.GA7043@thebird.nl>

Dear Toshiaki,

On Wed, Jan 20, 2010 at 01:21:54AM +0900, Toshiaki Katayama wrote:
> > I agree with Tomoaki it is too restrictive. What, indeed, if we want
> > to present the HTML in a different way?
> 
> Hmm. Could you provide me some use cases?

Think of URL's. One user wants to point a gene ID to NCBI. Another
to Swissprot. The container can not be aware of all exceptions - and
really should not handle it.

> Override the output_html method, or, use some template engine to be
> more generic.

Maybe those are good mechanisms. In the pre-hackathon we should
discuss these points.

> I can agree some files became too large to learn and/or maintain.
> But if we try to change the structure of current code base,
> we need to define a clean criteria beforehand.

Yes.

> If we separate files into sub files, people then need to look around
> the number of files, and it may also slow down the loading speed of
> the bioruby library. It is a problem of balance.
> 
> In both cases, lack of excellent guide to read through the bioruby
> library might be a essential issue.

I think if we structure the files and modules well - and make them
small enough - they become self-explaining. That would be my ultimate
goal.

> At some time, we may do refactoring to produce BioRuby 2.0.
> Before doing that, we can discuss how to sit all classes/codes cleanly.
> We may need someone who understand entire structure/contents of
> the current codebase and willing to design a better one with a good sense.

Yes. I agree it is a big step. But we should go for this type of
challenge.

> > Don't you think the Sequence, or KEGG, object should not care about
> > HTML? Or RDF, or plotting? Those are separate functionalities. They
> > share common access patterns - which are part of the DB class.
> 
> Again, we can take both approach. My current proposal is conservative one.
> Just add these functionalities in each class as the class knows what is in it
> and what is the best way to represent the contents.
> 
> If we separate formatting/plotting functionalities into separate class,
> which might be something like Bio::FlatFile class who knows the header
> line format of every database entries. Or we may design better one.

FlatFile has some downsides. It has complicated the libraries.
Complication means the modules are less easy to adapt/modify. I think
it is slightly over-engineered. Maybe not enough of a problem to take
it out, but I hope you see where I am coming from.

> Anyway, I'm now listening. So, please don't stick with HTML things only
> and think a global design to which we can plan to migrate.

I have to spend a day on a writeup. In the coming two weeks. I will
try to explain my ideas.

> Maybe from esthetics viewpoint?
> 
> I think it looks better, and, we can easily switch the output format
> depending on the context without modifying the code.
> Something like a @media property in CSS (screen, print etc.) in mind.
> 
> if used_for_semantic_web?
>   format = :rdf
>   # add some codes to do preparation job for SW
> elsif used_for_blast?
>   format = :fasta
>   # add some codes to do preparation job for blast
> end
> 
> # we don't need to change the following line in any context
> entry.output(format)

I see your point. The criticism is that it obfuscates the real
intention of the code - i.e. it is not self documenting any longer.
But, I guess, this boils down to preferences and acquired tastes. It
is not obvious to a newbie, though it may be obvious for someone who
is accustomed to Bioruby internals. Which may be good - depending on
our basic values.

Pj.

From ktym at hgc.jp  Tue Jan 19 19:49:37 2010
From: ktym at hgc.jp (Toshiaki Katayama)
Date: Wed, 20 Jan 2010 09:49:37 +0900
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <20100119205241.GA7043@thebird.nl>
References: <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
	<20100116083041.GA2663@thebird.nl>
	<63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>
	<20100117135441.GA24341@thebird.nl>
	<20100119105056.GA29525@thebird.nl>
	<0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp>
	<20100119143422.GA1781@thebird.nl>
	<14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp>
	<20100119205241.GA7043@thebird.nl>
Message-ID: <DB581F4F-C57E-4007-B3B9-6BFB89BC20CE@hgc.jp>

Dear Pj,

On 2010/01/20, at 5:52, Pjotr Prins wrote:

> Dear Toshiaki,
> 
> On Wed, Jan 20, 2010 at 01:21:54AM +0900, Toshiaki Katayama wrote:
>>> I agree with Tomoaki it is too restrictive. What, indeed, if we want
>>> to present the HTML in a different way?
>> 
>> Hmm. Could you provide me some use cases?
> 
> Think of URL's. One user wants to point a gene ID to NCBI. Another
> to Swissprot. The container can not be aware of all exceptions - and
> really should not handle it.

Still not clear to me.

I supposed to generate a URL string for the href attribute of <a>.
However, is there any IDs which needs to be escaped?
Or do you mean to embed a HTML snippet in URL?
If so, we may need to use URL encoding (URI.escape) 
instead of the HTML escaping (CGI.escapeHTML).


> 
>> Override the output_html method, or, use some template engine to be
>> more generic.
> 
> Maybe those are good mechanisms. In the pre-hackathon we should
> discuss these points.

Is there any better replacement for Ruby's CGI library available?

Requirements:

- separation of the HTML from CGI

CGI.escapeHTML looks ugly in terms of the naming convention (CamelCase)
and the name space -- why not HTML.escape(string). Moreover, we don't
want to require 'cgi' just for escaping a HTML string.

- support for templates (separation of logic and presentation)

I had used erb and html-template. Sometimes erb is too slow (especially
when it contains a nested loop to generate a number of lists or tables).

- bundled with Ruby as a standard library

Otherwise, we'd better to use Rails as a default environment
(from a viewpoint of popularity).


> 
>> I can agree some files became too large to learn and/or maintain.
>> But if we try to change the structure of current code base,
>> we need to define a clean criteria beforehand.
> 
> Yes.
> 
>> If we separate files into sub files, people then need to look around
>> the number of files, and it may also slow down the loading speed of
>> the bioruby library. It is a problem of balance.
>> 
>> In both cases, lack of excellent guide to read through the bioruby
>> library might be a essential issue.
> 
> I think if we structure the files and modules well - and make them
> small enough - they become self-explaining. That would be my ultimate
> goal.
> 
>> At some time, we may do refactoring to produce BioRuby 2.0.
>> Before doing that, we can discuss how to sit all classes/codes cleanly.
>> We may need someone who understand entire structure/contents of
>> the current codebase and willing to design a better one with a good sense.
> 
> Yes. I agree it is a big step. But we should go for this type of
> challenge.
> 
>>> Don't you think the Sequence, or KEGG, object should not care about
>>> HTML? Or RDF, or plotting? Those are separate functionalities. They
>>> share common access patterns - which are part of the DB class.
>> 
>> Again, we can take both approach. My current proposal is conservative one.
>> Just add these functionalities in each class as the class knows what is in it
>> and what is the best way to represent the contents.
>> 
>> If we separate formatting/plotting functionalities into separate class,
>> which might be something like Bio::FlatFile class who knows the header
>> line format of every database entries. Or we may design better one.
> 
> FlatFile has some downsides. It has complicated the libraries.
> Complication means the modules are less easy to adapt/modify. I think
> it is slightly over-engineered. Maybe not enough of a problem to take
> it out, but I hope you see where I am coming from.
> 
>> Anyway, I'm now listening. So, please don't stick with HTML things only
>> and think a global design to which we can plan to migrate.
> 
> I have to spend a day on a writeup. In the coming two weeks. I will
> try to explain my ideas.


OK, let's discuss about these topics as well, during the pre-hackathon
meeting (7th Feb) in Tokyo with other core developers.


> 
>> Maybe from esthetics viewpoint?
>> 
>> I think it looks better, and, we can easily switch the output format
>> depending on the context without modifying the code.
>> Something like a @media property in CSS (screen, print etc.) in mind.
>> 
>> if used_for_semantic_web?
>>  format = :rdf
>>  # add some codes to do preparation job for SW
>> elsif used_for_blast?
>>  format = :fasta
>>  # add some codes to do preparation job for blast
>> end
>> 
>> # we don't need to change the following line in any context
>> entry.output(format)
> 
> I see your point. The criticism is that it obfuscates the real
> intention of the code - i.e. it is not self documenting any longer.
> But, I guess, this boils down to preferences and acquired tastes. It
> is not obvious to a newbie, though it may be obvious for someone who
> is accustomed to Bioruby internals. Which may be good - depending on
> our basic values.
> 
> Pj.


Note that, you can still directly use the output_html method in each
database class. The output(format) method is prepared just as an abstract
interface, which will be useful in the above situation, for example.

Therefore, following both cases should return the same result and
you can choose the coding style depending on the situation.

# case 1
format = :rdf
entry.output(format)

# case 2
entry.output_rdf

You can also check entry.respond_to?(:output_rdf) in both cases.

Toshiaki


From pjotr.public14 at thebird.nl  Wed Jan 20 02:36:44 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Wed, 20 Jan 2010 08:36:44 +0100
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp>
References: <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
	<20100116083041.GA2663@thebird.nl>
	<63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>
	<20100117135441.GA24341@thebird.nl>
	<20100119105056.GA29525@thebird.nl>
	<0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp>
	<20100119143422.GA1781@thebird.nl>
	<14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp>
Message-ID: <20100120073644.GA11295@thebird.nl>

Dear Toshiaki,

On Wed, Jan 20, 2010 at 01:21:54AM +0900, Toshiaki Katayama wrote:
> > I really would like all HTML to be in one sub-tree. Also XML, RDF and
> > whatnot. When it is 'business' logic it should be in database. When it
> > is output transformations it is not 'business' logic any longer.
> 
> I'm not sure about HTML but FASTA and RDF, for example, are tightly
> related to the original database format/contents. So, I proposed
> to have methods to generate formatted string in each database class.
> 
> There can be many ways to design OO class trees and to find the best
> way to represent/abstract things is always a difficult task.

I wrote a nice alignment HTML output generator. Which also displays PAML
output. Currently it is in bio/output/html/htmlalignment.rb and the
class is named Bio::Html::Alignment. 

For the current Bioruby, where do you want to put that? I don't feel
it should be cluttering alignment.rb. Naohisa has suggested
bio/alignment/output/html/alignment.rb instead. I feel uncomfortable
with this. But it is kinda consistent with above, tightly relating it
to the alignment object.

What do you think of the class name?

The code is in my color-alignment branch, see

  http://github.com/pjotrp/bioruby/tree/color-alignment

Is anyone else interested in this type of discussion? We can take it
off-list.

Pj.

From missy at be.to  Wed Jan 20 04:17:50 2010
From: missy at be.to (MISHIMA, Hiroyuki)
Date: Wed, 20 Jan 2010 18:17:50 +0900
Subject: [BioRuby] trouble on the FASTA.QUAL format (Bio::FastaNumericFormat)
Message-ID: <4B56CA3E.8000905@be.to>

Hi all,

I am using BioRuby 1.4.0., and have a trouble in handling the FASTA.QUAL
format using Bio::FastaNumericFormat.

Please see the following code:
========================
require 'rubygems'
require 'bio'

FASTA_QUAL =<<'EOS'
>SAMPLE1
30 30 29 42
EOS

qual = Bio::FastaNumericFormat.new(FASTA_QUAL)
bs = qual.to_biosequence
puts bs.output(:raw)
=========================

The last line raise an error:

=========================
(eval):2:in `__get__seq': undefined method `seq' for 
#<Bio::FastaNumericFormat:0x2b182810ceb0> (NoMethodError)
         from (eval):4:in `seq'
         from 
/home/misshie/.gem/ruby/1.8/gems/bio-1.4.0/lib/bio/sequence/format_raw.rb:19:in 
`output'
         from 
/home/misshie/.gem/ruby/1.8/gems/bio-1.4.0/lib/bio/sequence/format.rb:97:in 
`output'
         from 
/home/misshie/.gem/ruby/1.8/gems/bio-1.4.0/lib/bio/sequence/format.rb:172:in 
`output'
         from fasta_numeric_format.rb:11
=========================

In the last line, using :fasta, :fasta_numeric etc. make same results.

Please let me know if you have ideas to solve this problem.

Hiro.
-- 
MISHIMA, Hiroyuki, DDS, Ph.D.
COE Research Fellow
Department of Human Genetics
Nagasaki University Graduate School of Biomedical Sciences

From andrew.j.grimm at gmail.com  Wed Jan 20 07:09:19 2010
From: andrew.j.grimm at gmail.com (Andrew Grimm)
Date: Wed, 20 Jan 2010 23:09:19 +1100
Subject: [BioRuby] Thread-safety of alignment
Message-ID: <b9140daa1001200409u24a6cc7dgbc74cf4725b3f4e1@mail.gmail.com>

Is alignment intended to be thread-safe in bioruby? If so, should I
use the same alignment factory between threads, or a separate one in
each thread?

Andrew

From ngoto at gen-info.osaka-u.ac.jp  Wed Jan 20 08:36:29 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Wed, 20 Jan 2010 22:36:29 +0900
Subject: [BioRuby] trouble on the FASTA.QUAL format
 (Bio::FastaNumericFormat)
In-Reply-To: <4B56CA3E.8000905@be.to>
References: <4B56CA3E.8000905@be.to>
Message-ID: <20100120133630.052BF1CBC433@idnmail.gen-info.osaka-u.ac.jp>

Hi,

This is a bug, and will be fixed.
Indeed, Bio::FastaNumericFormat does not contain sequence,
and I forgot to take care about calling to_biosequence.

For a workaroud,

  qual = Bio::FastaNumericFormat.new(FASTA_QUAL)
  bs = Bio::Sequence.new('')
  bs.quality_scores = qual.data
  puts bs.output(:fasta_numeric)

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

On Wed, 20 Jan 2010 18:17:50 +0900
"MISHIMA, Hiroyuki" <missy at be.to> wrote:

> Hi all,
> 
> I am using BioRuby 1.4.0., and have a trouble in handling the FASTA.QUAL
> format using Bio::FastaNumericFormat.
> 
> Please see the following code:
> ========================
> require 'rubygems'
> require 'bio'
> 
> FASTA_QUAL =<<'EOS'
> >SAMPLE1
> 30 30 29 42
> EOS
> 
> qual = Bio::FastaNumericFormat.new(FASTA_QUAL)
> bs = qual.to_biosequence
> puts bs.output(:raw)
> =========================
> 
> The last line raise an error:
> 
> =========================
> (eval):2:in `__get__seq': undefined method `seq' for 
> #<Bio::FastaNumericFormat:0x2b182810ceb0> (NoMethodError)
>          from (eval):4:in `seq'
>          from 
> /home/misshie/.gem/ruby/1.8/gems/bio-1.4.0/lib/bio/sequence/format_raw.rb:19:in 
> `output'
>          from 
> /home/misshie/.gem/ruby/1.8/gems/bio-1.4.0/lib/bio/sequence/format.rb:97:in 
> `output'
>          from 
> /home/misshie/.gem/ruby/1.8/gems/bio-1.4.0/lib/bio/sequence/format.rb:172:in 
> `output'
>          from fasta_numeric_format.rb:11
> =========================
> 
> In the last line, using :fasta, :fasta_numeric etc. make same results.
> 
> Please let me know if you have ideas to solve this problem.
> 
> Hiro.
> -- 
> MISHIMA, Hiroyuki, DDS, Ph.D.
> COE Research Fellow
> Department of Human Genetics
> Nagasaki University Graduate School of Biomedical Sciences
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From ngoto at gen-info.osaka-u.ac.jp  Wed Jan 20 08:50:45 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Wed, 20 Jan 2010 22:50:45 +0900
Subject: [BioRuby] Thread-safety of alignment
In-Reply-To: <b9140daa1001200409u24a6cc7dgbc74cf4725b3f4e1@mail.gmail.com>
References: <b9140daa1001200409u24a6cc7dgbc74cf4725b3f4e1@mail.gmail.com>
Message-ID: <20100120135046.158711CBC511@idnmail.gen-info.osaka-u.ac.jp>

Hi,

On Wed, 20 Jan 2010 23:09:19 +1100
Andrew Grimm <andrew.j.grimm at gmail.com> wrote:

> Is alignment intended to be thread-safe in bioruby? If so, should I
> use the same alignment factory between threads, or a separate one in
> each thread?

It is not confirmed to be thread-safe, so it is safe to use
separate one in each thread.

Currently, in BioRuby, manipulating the same object from different
threads is not intended. When manipulating the same object from
different threads is needed, using mutex is recommended.

For library developers, it is encouraged to write thread-safe
code if possible, but not mandatory.

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

> 
> Andrew
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From ktym at hgc.jp  Thu Jan 21 09:05:42 2010
From: ktym at hgc.jp (Toshiaki Katayama)
Date: Thu, 21 Jan 2010 23:05:42 +0900
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <20100120073644.GA11295@thebird.nl>
References: <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
	<20100116083041.GA2663@thebird.nl>
	<63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>
	<20100117135441.GA24341@thebird.nl>
	<20100119105056.GA29525@thebird.nl>
	<0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp>
	<20100119143422.GA1781@thebird.nl>
	<14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp>
	<20100120073644.GA11295@thebird.nl>
Message-ID: <7B739736-1D0D-43E2-89E8-8F6B4DCC3404@hgc.jp>

Dear Pj,

I looked your code and had a feeling that we should use some template system.
If HTML tags are hard coded in the library as you did, it will be very hard to modify them by the user.

Besides, what version of the HTML specification did you have in mind?
This is my first time to see the <p> tag is used in the form of <p />. Is it valid?
I also think decorations should be separated to the CSS layer and you should avoid to use the <font> tag, especially when you are trying to distribute your code as a part of the library.


As for the file location, I still like the way Naohisa has suggested.
Although, I'm not sure the internal node 'output/html' is necessary for 'bio/alignment/output/html/alignment.rb'.
Anyway, we need to try every approach to learn pros and cons.

With your proposal, we may have a tree like this:

--------------------------------------------------
for bio/alignment.rb and bio/db/kegg/compound.rb and bio/db/genbank.rb ...

bio/output/html/html_alignment.rb (Bio::Html::Alignment)
bio/output/html/html_kegg_compound.rb (Bio::Html::KEGG::COMPOUND)
bio/output/html/html_genbank.rb  (Bio::Html::GenBank)
 :

bio/output/rdf/rdf_kegg_compound.rb (Bio::RDF::KEGG::COMPOUND)
bio/output/rdf/rdf_genbank.rb (Bio::RDF::GenBank)
 :

bio/output/fasta/fasta_genbank.rb (Bio::FASTA::GenBank)
bio/output/fasta/fasta_kegg_genes.rb (Bio::FASTA::KEGG::GENES)
 :

bio/output/gff/gff_genbank.rb (Bio::GFF::GenBank)
 :
--------------------------------------------------

apparently, the class names for output formats conflict with existing classes (e.g. Bio::FASTA, Bio::GFF) and we need to look into each sub directories to find which output format is supported for a particular database.


If we gather templates of output formats along with the database classes:

--------------------------------------------------
for bio/alignment.rb:
bio/alignment/alignment.html.erb
 :

for bio/db/kegg/compound.rb:
bio/db/kegg/compound/compound.rdf.erb
bio/db/kegg/compound/compound.tut.erb
bio/db/kegg/compound/compound.html.erb
 :

for bio/db/genbank.rb:
bio/db/genbank/genbank.rdf.erb
bio/db/genbank/genbank.gff.erb
bio/db/genbank/genbank.html.erb
bio/db/genbank/genbank.fasta.erb
 :
--------------------------------------------------

However, this is still a desk plan and we need to try more (we already started for RDF).

Toshiaki


On 2010/01/20, at 16:36, Pjotr Prins wrote:

> Dear Toshiaki,
> 
> On Wed, Jan 20, 2010 at 01:21:54AM +0900, Toshiaki Katayama wrote:
>>> I really would like all HTML to be in one sub-tree. Also XML, RDF and
>>> whatnot. When it is 'business' logic it should be in database. When it
>>> is output transformations it is not 'business' logic any longer.
>> 
>> I'm not sure about HTML but FASTA and RDF, for example, are tightly
>> related to the original database format/contents. So, I proposed
>> to have methods to generate formatted string in each database class.
>> 
>> There can be many ways to design OO class trees and to find the best
>> way to represent/abstract things is always a difficult task.
> 
> I wrote a nice alignment HTML output generator. Which also displays PAML
> output. Currently it is in bio/output/html/htmlalignment.rb and the
> class is named Bio::Html::Alignment. 
> 
> For the current Bioruby, where do you want to put that? I don't feel
> it should be cluttering alignment.rb. Naohisa has suggested
> bio/alignment/output/html/alignment.rb instead. I feel uncomfortable
> with this. But it is kinda consistent with above, tightly relating it
> to the alignment object.
> 
> What do you think of the class name?
> 
> The code is in my color-alignment branch, see
> 
>  http://github.com/pjotrp/bioruby/tree/color-alignment
> 
> Is anyone else interested in this type of discussion? We can take it
> off-list.
> 
> Pj.


From pjotr.public14 at thebird.nl  Thu Jan 21 11:20:49 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Thu, 21 Jan 2010 17:20:49 +0100
Subject: [BioRuby] Proposal: Bioruby modules (the bazaar)
Message-ID: <20100121162049.GB31462@thebird.nl>

Dear Toshiaki,

On Thu, Jan 21, 2010 at 11:05:42PM +0900, Toshiaki Katayama wrote:
> I looked your code and had a feeling that we should use some
> template system.  If HTML tags are hard coded in the library as you
> did, it will be very hard to modify them by the user.

Aren't we trying to overcomplicate things? This is an HTML generator
- in fact it is embedded HTML as I don't provide the <html>, header or
body parts. It can just be inserted into Rails, or whatever HTML
framework that is out there.

Templating is just another abstraction. I don't intend to template
engines like Rails.

Or, are you here merely referring to using the CGI class (or something
like that).  I guess I could do that, though I have trouble seeing the
benefits. It is just another way of writing HTML statements.

> Besides, what version of the HTML specification did you have in
> mind?
> This is my first time to see the <p> tag is used in the form of <p />. Is it valid?

Yes. It is, in fact, XHTML.

> I also think decorations should be separated to the CSS layer and you should avoid to use the <font> tag, especially when you are trying to distribute your code as a part of the library.

We use hard coded colors. I could use CSS, but then you need to
provide a CSS file (or I need to hard code the header of the file).
That makes it (again) more complicated than necessary. Where do we
store the CSS file, how do we make sure the browser finds it? CSS is
really to adapt look and feel. If the output is meant to be fixed, why
make it flexible?  Besides all (future) browsers support the font tag,
as used. If that stops we could always adapt that source code.

> As for the file location, I still like the way Naohisa has
> suggested.

Alright. I can move the files, if that was all.

However, my colored alignment is not going to make it into Bioruby
this way. There is always something wrong with my code, it appears.
Now I need to move file locations that have not really been decided
on; I need to template HTML - but we haven't decided how and it is
questionable; I need to use CSS, though I think it makes things worse
for users.

Are we really sure you want to reject this code just because it does
not live up to everyone's current and future expectations? It may
still be useful to someone else, you know, it does not break anything
else, and can be improved in the future. Once we decide what we want
to achieve.

The same really holds to my PAML branch and my GEO branch. Both
contain useful utilities for others to use. And now the alignment is
the third pending Bioruby branch.

Can you imagine my growing frustration? Should this go into Bioruby,
or should I start another project, like others have done? Or stick it
into my existing biotools or bigbio projects? Just, so I don't have
the hassle?

The way the Perl people handle it is by having independent modules.
Everyone owns his, or her, own module and Perl's CPAN acts more as an
aggragator. The advantage is that the environment is more dynamic. And
you really don't care what is inside a module. That is up to the
maintainer and his/her users.

We could create independent BioRuby modules, which have their own git
repositories. When a module is nice enough to include in Bioruby make
it a git submodule - I use this technique for biolib - it will
register in the BioRuby repository. That way Bioruby still controls
what goes in a release. However, modules can be maintained for
experimental setups or private use. So my modules would go in

  lib/bio/modules/paml
  lib/bio/modules/geo
  lib/bio/modules/htmlalignment

each its own git repository.

When one of those is 'strong' enough for main line you move it into a
different location in the main repository. Modules could even be
included in Bioruby releases.

What hurts me now is that no one is going to use my code, since I
don't have the time to make it perfect, and it is hidden in my
experimental Bioruby branches. We should find a way to make
'experimental code' available to the rest of the community. That way
we may also 'recruit' help to make the code more perfect. 

Make it easy to allow external modules to become visible through
Bioruby - that is a win-win, as well as a more bazaar-like approach
to OSS development.

I wonder how many people on this list would contribute code if it was
more loosely organised.

Pj.

From ktym at hgc.jp  Thu Jan 21 12:54:24 2010
From: ktym at hgc.jp (Toshiaki Katayama)
Date: Fri, 22 Jan 2010 02:54:24 +0900
Subject: [BioRuby] Proposal: Bioruby modules (the bazaar)
In-Reply-To: <20100121162049.GB31462@thebird.nl>
References: <20100121162049.GB31462@thebird.nl>
Message-ID: <DF2DA732-ABF3-492E-893A-FB5E06AA5289@hgc.jp>

Dear Pj,

I can understand your frustration and I like your idea of the
'module' system, as it reminds me the way how the Linux kernel
tree is successfully maintained.

> I wonder how many people on this list would contribute code if it was
> more loosely organised.

Indeed.

However, I think our move from cvs to git was already a great step
that it opened large opportunity to all those who want to participate
in development. Before doing that, "open source" project not always
mean "open to join" project.

Now, everyone can easily fork the project and release their modified
codes as you already done. So, we may able to evaluate from the current
situation that how many other people have tried.

Anyway, it is still a difficult problem that who will decide and
how to decide when to migrate the contributed code into the main tree.
It might sound like a excuse, but I'm also suffering from the difficulty.
I also have several modules which are not yet contributed to the main tree.
For example, my SGE library for BioRuby (http://kanehisa.hgc.jp/~k/sge/)
because I'm not sure it is general enough and where it fits.


As for the HTML portion, I see your point.

* I'd like to hear comments from others.
* How people like to render/visualize the BioRuby objects (especially in HTML)?
* I didn't mean to use the CGI class for HTML generation (I even don't like that).
* The use of <p /> seems invalid in XHTML. See http://www.w3.org/TR/xhtml1/#C_3


P.S.
Once, I had developed a mechanism to integrate end-user code snippets
in the BioRuby shell, called plugins. I wrote some plugins which render
a colored codon table, a formatted summary of sequence properties etc.

If those and functions defined in your plugins can be easily accessed by

  puts Bio.your_function_name(options)

or something like that, is it satisfy your needs?

If so, we can consider to make a repository for such plugins and bundle
them in the BioRuby as well.

Regards,
Toshiaki Katayama


On 2010/01/22, at 1:20, Pjotr Prins wrote:

> Dear Toshiaki,
> 
> On Thu, Jan 21, 2010 at 11:05:42PM +0900, Toshiaki Katayama wrote:
>> I looked your code and had a feeling that we should use some
>> template system.  If HTML tags are hard coded in the library as you
>> did, it will be very hard to modify them by the user.
> 
> Aren't we trying to overcomplicate things? This is an HTML generator
> - in fact it is embedded HTML as I don't provide the <html>, header or
> body parts. It can just be inserted into Rails, or whatever HTML
> framework that is out there.
> 
> Templating is just another abstraction. I don't intend to template
> engines like Rails.
> 
> Or, are you here merely referring to using the CGI class (or something
> like that).  I guess I could do that, though I have trouble seeing the
> benefits. It is just another way of writing HTML statements.
> 
>> Besides, what version of the HTML specification did you have in
>> mind?
>> This is my first time to see the <p> tag is used in the form of <p />. Is it valid?
> 
> Yes. It is, in fact, XHTML.
> 
>> I also think decorations should be separated to the CSS layer and you should avoid to use the <font> tag, especially when you are trying to distribute your code as a part of the library.
> 
> We use hard coded colors. I could use CSS, but then you need to
> provide a CSS file (or I need to hard code the header of the file).
> That makes it (again) more complicated than necessary. Where do we
> store the CSS file, how do we make sure the browser finds it? CSS is
> really to adapt look and feel. If the output is meant to be fixed, why
> make it flexible?  Besides all (future) browsers support the font tag,
> as used. If that stops we could always adapt that source code.
> 
>> As for the file location, I still like the way Naohisa has
>> suggested.
> 
> Alright. I can move the files, if that was all.
> 
> However, my colored alignment is not going to make it into Bioruby
> this way. There is always something wrong with my code, it appears.
> Now I need to move file locations that have not really been decided
> on; I need to template HTML - but we haven't decided how and it is
> questionable; I need to use CSS, though I think it makes things worse
> for users.
> 
> Are we really sure you want to reject this code just because it does
> not live up to everyone's current and future expectations? It may
> still be useful to someone else, you know, it does not break anything
> else, and can be improved in the future. Once we decide what we want
> to achieve.
> 
> The same really holds to my PAML branch and my GEO branch. Both
> contain useful utilities for others to use. And now the alignment is
> the third pending Bioruby branch.
> 
> Can you imagine my growing frustration? Should this go into Bioruby,
> or should I start another project, like others have done? Or stick it
> into my existing biotools or bigbio projects? Just, so I don't have
> the hassle?
> 
> The way the Perl people handle it is by having independent modules.
> Everyone owns his, or her, own module and Perl's CPAN acts more as an
> aggragator. The advantage is that the environment is more dynamic. And
> you really don't care what is inside a module. That is up to the
> maintainer and his/her users.
> 
> We could create independent BioRuby modules, which have their own git
> repositories. When a module is nice enough to include in Bioruby make
> it a git submodule - I use this technique for biolib - it will
> register in the BioRuby repository. That way Bioruby still controls
> what goes in a release. However, modules can be maintained for
> experimental setups or private use. So my modules would go in
> 
>  lib/bio/modules/paml
>  lib/bio/modules/geo
>  lib/bio/modules/htmlalignment
> 
> each its own git repository.
> 
> When one of those is 'strong' enough for main line you move it into a
> different location in the main repository. Modules could even be
> included in Bioruby releases.
> 
> What hurts me now is that no one is going to use my code, since I
> don't have the time to make it perfect, and it is hidden in my
> experimental Bioruby branches. We should find a way to make
> 'experimental code' available to the rest of the community. That way
> we may also 'recruit' help to make the code more perfect. 
> 
> Make it easy to allow external modules to become visible through
> Bioruby - that is a win-win, as well as a more bazaar-like approach
> to OSS development.
> 
> I wonder how many people on this list would contribute code if it was
> more loosely organised.
> 
> Pj.


From yannick.wurm at unil.ch  Thu Jan 21 13:21:40 2010
From: yannick.wurm at unil.ch (Yannick Wurm)
Date: Thu, 21 Jan 2010 19:21:40 +0100
Subject: [BioRuby] Proposal: Bioruby modules (the bazaar)
In-Reply-To: <mailman.17.1264093205.11305.bioruby@lists.open-bio.org>
References: <mailman.17.1264093205.11305.bioruby@lists.open-bio.org>
Message-ID: <EC30B4CB-86D0-4FC0-97D7-FDF86F697AF8@unil.ch>

On 21 Jan 2010, at 18:00, bioruby-request at lists.open-bio.org wrote:

> re we really sure you want to reject this code just because it does
> not live up to everyone's current and future expectations? It may
> still be useful to someone else, you know, it does not break anything
> else, and can be improved in the future. Once we decide what we want
> to achieve.

> 
> What hurts me now is that no one is going to use my code, since I
> don't have the time to make it perfect, and it is hidden in my
> experimental Bioruby branches. We should find a way to make
> 'experimental code' available to the rest of the community. That way
> we may also 'recruit' help to make the code more perfect. 


I agree 100% that enthusiastic bioruby improvements like Pjotr's should be encouraged & given maximal visibility.
It's better to have great tools with room for improvement than no tools. 
(a year or two ago I needed colored html alignments and ended up with an ugly, ugly hack that used t_coffee to generate html output from the alignments I'd generated elsewhere - something like Pjotr's code would have been much more elegant)

I also have the feeling that code contributions in general are given more negative than positive feedback on this list. I believe it's a grave mistake because the bioruby community will not grow without passionate users & contibutors and more quality code.

just my two cents,

yannick

--------------------------------------------
          yannick . wurm @ unil . ch
Ant Genomics, Ecology & Evolution @ Lausanne
   http://www.unil.ch/dee/page28685_fr.html


From pjotr.public14 at thebird.nl  Fri Jan 22 03:55:08 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Fri, 22 Jan 2010 09:55:08 +0100
Subject: [BioRuby] Proposal: Bioruby modules (the bazaar)
In-Reply-To: <DF2DA732-ABF3-492E-893A-FB5E06AA5289@hgc.jp>
References: <20100121162049.GB31462@thebird.nl>
	<DF2DA732-ABF3-492E-893A-FB5E06AA5289@hgc.jp>
Message-ID: <20100122085508.GB12248@thebird.nl>

On Fri, Jan 22, 2010 at 02:54:24AM +0900, Toshiaki Katayama wrote:
> Dear Pj,
> 
> I can understand your frustration and I like your idea of the
> 'module' system, as it reminds me the way how the Linux kernel
> tree is successfully maintained.

Thinking about it there are other good examples. The R language
supports modules in CRAN - similar in many ways to generic Perl CPAN
and Ruby's gems. But, on top of CRAN they also have Bioconductor which
aggregates Bio related modules. The main benefit is that it
pre-packages all Bio related packages and people can load it on the
fly. See http://www.bioconductor.org/

We don't want to replace gems - but I think the gem system is too
loose for most people, and it requires every module to understand and
comply with the gem system.

I think Bioruby can play a role here. We can have modules (or
plugins, like Rails has) that come either with Bioruby's
installation, or get installed on request. If we find a syntax for
that it would be great. E.g.

  Bio::Module.load(:html_alignment)

If it is part of Bioruby, pass. Otherwise throw error: 

"Bio::Module :html_alignment not installed, try Bio::Module.install(:html_alignment)"

  Bio::Module.install(:html_alignment)

will search the definition and install it. Depending on the module it
can be installed as a gem, or fetched through git or a tarball (an
optional parameter can overrule behaviour). On success one can start
as either function will prepare for:

  html_aln = Bio::Html::Alignment.new('my.aln')

The nice thing about this setup is that

(1) It is really easy on the user

(2) Decouples the module from Bioruby - all issues are between the
users and the module maintainer - discussions can still be on the
main mailing list

(3) Retains some control on what modules are allowed in, an what not

(4) Modules can be obsoleted

(5) Modules can be updated outside Bioruby's mainline. e.g. Bio::Module.install(:html_alignment,:development=>true)

Pj.


From tomoakin at kenroku.kanazawa-u.ac.jp  Fri Jan 22 04:12:29 2010
From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA)
Date: Fri, 22 Jan 2010 18:12:29 +0900
Subject: [BioRuby] Proposal: Bioruby modules (the bazaar)
In-Reply-To: <DF2DA732-ABF3-492E-893A-FB5E06AA5289@hgc.jp>
References: <20100121162049.GB31462@thebird.nl>
	<DF2DA732-ABF3-492E-893A-FB5E06AA5289@hgc.jp>
Message-ID: <066BB141-7217-4343-85B4-165072A58E06@kenroku.kanazawa-u.ac.jp>

Hi,

> As for the HTML portion, I see your point.
>
> * I'd like to hear comments from others.
> * How people like to render/visualize the BioRuby objects  
> (especially in HTML)?
> * I didn't mean to use the CGI class for HTML generation (I even  
> don't like that).


Perhaps the way to render the objects depends on both objects and  
purposes,
but if the object has a string representation, just showing them is  
perhaps
a good default. Also defining the way how to represent in HTML or any
other format for all classes comprehensively is too laborious as the  
first step
and a way to allow gradual growth of the codebase seems good.
It is the way flatfile parser grew to support many formats.

Thus, mechanism to do class specific conversion and default  
conversion for
non HTML aware classes is good.
Criticism on 'cgi' library for the default
conversion CGI.escapeHTML(object.to_s), especially for the name
is understandable.
There are already criticism on CGI.rb in itself
<http://jp.rubyist.net/magazine/?0023-Cgirb>
but there are no *standard* alternatives yet.
Perhaps we can just copy or rewrite the escapeHTML code
and make it any name that fits our purpose.

A drawback of having our escapeHTML code is that it could be
redundant in many cases where html generation is for CGI, and
we cannot get benefit from CGIAlt or any other compatible speedup
library on CGI, rewrite or extension with C. But I think this is
not a very large problem.

Making require 'bio' automatically loading cgi.rb is undesirable.
If the html code is not automatically loaded by require 'bio'
but loaded only another call require 'bio/html', then
I feel 'bio/html' loading cgi.rb is in a reasonable range.

Capability to use style instead of directly specifying color and
font is desirable since it could reduce the output size, and
possibly readability.
Nontheless, this is not mandatory and the first implementation
with direct specification is ok.

-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan


On 2010/01/22, at 2:54, Toshiaki Katayama wrote:

> Dear Pj,
>
> I can understand your frustration and I like your idea of the
> 'module' system, as it reminds me the way how the Linux kernel
> tree is successfully maintained.
>
>> I wonder how many people on this list would contribute code if it was
>> more loosely organised.
>
> Indeed.
>
> However, I think our move from cvs to git was already a great step
> that it opened large opportunity to all those who want to participate
> in development. Before doing that, "open source" project not always
> mean "open to join" project.
>
> Now, everyone can easily fork the project and release their modified
> codes as you already done. So, we may able to evaluate from the  
> current
> situation that how many other people have tried.
>
> Anyway, it is still a difficult problem that who will decide and
> how to decide when to migrate the contributed code into the main tree.
> It might sound like a excuse, but I'm also suffering from the  
> difficulty.
> I also have several modules which are not yet contributed to the  
> main tree.
> For example, my SGE library for BioRuby (http://kanehisa.hgc.jp/~k/ 
> sge/)
> because I'm not sure it is general enough and where it fits.
>
>
> As for the HTML portion, I see your point.
>
> * I'd like to hear comments from others.
> * How people like to render/visualize the BioRuby objects  
> (especially in HTML)?
> * I didn't mean to use the CGI class for HTML generation (I even  
> don't like that).
> * The use of <p /> seems invalid in XHTML. See http://www.w3.org/TR/ 
> xhtml1/#C_3
>
>
> P.S.
> Once, I had developed a mechanism to integrate end-user code snippets
> in the BioRuby shell, called plugins. I wrote some plugins which  
> render
> a colored codon table, a formatted summary of sequence properties etc.
>
> If those and functions defined in your plugins can be easily  
> accessed by
>
>   puts Bio.your_function_name(options)
>
> or something like that, is it satisfy your needs?
>
> If so, we can consider to make a repository for such plugins and  
> bundle
> them in the BioRuby as well.
>
> Regards,
> Toshiaki Katayama
>
>
> On 2010/01/22, at 1:20, Pjotr Prins wrote:
>
>> Dear Toshiaki,
>>
>> On Thu, Jan 21, 2010 at 11:05:42PM +0900, Toshiaki Katayama wrote:
>>> I looked your code and had a feeling that we should use some
>>> template system.  If HTML tags are hard coded in the library as you
>>> did, it will be very hard to modify them by the user.
>>
>> Aren't we trying to overcomplicate things? This is an HTML generator
>> - in fact it is embedded HTML as I don't provide the <html>,  
>> header or
>> body parts. It can just be inserted into Rails, or whatever HTML
>> framework that is out there.
>>
>> Templating is just another abstraction. I don't intend to template
>> engines like Rails.
>>
>> Or, are you here merely referring to using the CGI class (or  
>> something
>> like that).  I guess I could do that, though I have trouble seeing  
>> the
>> benefits. It is just another way of writing HTML statements.
>>
>>> Besides, what version of the HTML specification did you have in
>>> mind?
>>> This is my first time to see the <p> tag is used in the form of  
>>> <p />. Is it valid?
>>
>> Yes. It is, in fact, XHTML.
>>
>>> I also think decorations should be separated to the CSS layer and  
>>> you should avoid to use the <font> tag, especially when you are  
>>> trying to distribute your code as a part of the library.
>>
>> We use hard coded colors. I could use CSS, but then you need to
>> provide a CSS file (or I need to hard code the header of the file).
>> That makes it (again) more complicated than necessary. Where do we
>> store the CSS file, how do we make sure the browser finds it? CSS is
>> really to adapt look and feel. If the output is meant to be fixed,  
>> why
>> make it flexible?  Besides all (future) browsers support the font  
>> tag,
>> as used. If that stops we could always adapt that source code.
>>
>>> As for the file location, I still like the way Naohisa has
>>> suggested.
>>
>> Alright. I can move the files, if that was all.
>>
>> However, my colored alignment is not going to make it into Bioruby
>> this way. There is always something wrong with my code, it appears.
>> Now I need to move file locations that have not really been decided
>> on; I need to template HTML - but we haven't decided how and it is
>> questionable; I need to use CSS, though I think it makes things worse
>> for users.
>>
>> Are we really sure you want to reject this code just because it does
>> not live up to everyone's current and future expectations? It may
>> still be useful to someone else, you know, it does not break anything
>> else, and can be improved in the future. Once we decide what we want
>> to achieve.
>>
>> The same really holds to my PAML branch and my GEO branch. Both
>> contain useful utilities for others to use. And now the alignment is
>> the third pending Bioruby branch.
>>
>> Can you imagine my growing frustration? Should this go into Bioruby,
>> or should I start another project, like others have done? Or stick it
>> into my existing biotools or bigbio projects? Just, so I don't have
>> the hassle?
>>
>> The way the Perl people handle it is by having independent modules.
>> Everyone owns his, or her, own module and Perl's CPAN acts more as an
>> aggragator. The advantage is that the environment is more dynamic.  
>> And
>> you really don't care what is inside a module. That is up to the
>> maintainer and his/her users.
>>
>> We could create independent BioRuby modules, which have their own git
>> repositories. When a module is nice enough to include in Bioruby make
>> it a git submodule - I use this technique for biolib - it will
>> register in the BioRuby repository. That way Bioruby still controls
>> what goes in a release. However, modules can be maintained for
>> experimental setups or private use. So my modules would go in
>>
>>  lib/bio/modules/paml
>>  lib/bio/modules/geo
>>  lib/bio/modules/htmlalignment
>>
>> each its own git repository.
>>
>> When one of those is 'strong' enough for main line you move it into a
>> different location in the main repository. Modules could even be
>> included in Bioruby releases.
>>
>> What hurts me now is that no one is going to use my code, since I
>> don't have the time to make it perfect, and it is hidden in my
>> experimental Bioruby branches. We should find a way to make
>> 'experimental code' available to the rest of the community. That way
>> we may also 'recruit' help to make the code more perfect.
>>
>> Make it easy to allow external modules to become visible through
>> Bioruby - that is a win-win, as well as a more bazaar-like approach
>> to OSS development.
>>
>> I wonder how many people on this list would contribute code if it was
>> more loosely organised.
>>
>> Pj.
>
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>


From jan.aerts at gmail.com  Fri Jan 22 04:34:43 2010
From: jan.aerts at gmail.com (Jan Aerts)
Date: Fri, 22 Jan 2010 09:34:43 +0000
Subject: [BioRuby] Proposal: Bioruby modules (the bazaar)
In-Reply-To: <EC30B4CB-86D0-4FC0-97D7-FDF86F697AF8@unil.ch>
References: <mailman.17.1264093205.11305.bioruby@lists.open-bio.org>
	<EC30B4CB-86D0-4FC0-97D7-FDF86F697AF8@unil.ch>
Message-ID: <4c7507a71001220134j3eecf626y90755ddd919336e4@mail.gmail.com>

Hear, hear... Exactly my feelings as well.

j.

2010/1/21 Yannick Wurm <yannick.wurm at unil.ch>

> On 21 Jan 2010, at 18:00, bioruby-request at lists.open-bio.org wrote:
>
> > re we really sure you want to reject this code just because it does
> > not live up to everyone's current and future expectations? It may
> > still be useful to someone else, you know, it does not break anything
> > else, and can be improved in the future. Once we decide what we want
> > to achieve.
>
> >
> > What hurts me now is that no one is going to use my code, since I
> > don't have the time to make it perfect, and it is hidden in my
> > experimental Bioruby branches. We should find a way to make
> > 'experimental code' available to the rest of the community. That way
> > we may also 'recruit' help to make the code more perfect.
>
>
> I agree 100% that enthusiastic bioruby improvements like Pjotr's should be
> encouraged & given maximal visibility.
> It's better to have great tools with room for improvement than no tools.
> (a year or two ago I needed colored html alignments and ended up with an
> ugly, ugly hack that used t_coffee to generate html output from the
> alignments I'd generated elsewhere - something like Pjotr's code would have
> been much more elegant)
>
> I also have the feeling that code contributions in general are given more
> negative than positive feedback on this list. I believe it's a grave mistake
> because the bioruby community will not grow without passionate users &
> contibutors and more quality code.
>
> just my two cents,
>
> yannick
>
> --------------------------------------------
>          yannick . wurm @ unil . ch
> Ant Genomics, Ecology & Evolution @ Lausanne
>   http://www.unil.ch/dee/page28685_fr.html
>
>
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>

From tomoakin at kenroku.kanazawa-u.ac.jp  Fri Jan 22 04:48:20 2010
From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA)
Date: Fri, 22 Jan 2010 18:48:20 +0900
Subject: [BioRuby] Proposal: Bioruby modules (the bazaar)
In-Reply-To: <20100122085508.GB12248@thebird.nl>
References: <20100121162049.GB31462@thebird.nl>
	<DF2DA732-ABF3-492E-893A-FB5E06AA5289@hgc.jp>
	<20100122085508.GB12248@thebird.nl>
Message-ID: <1B20D151-8246-47B1-8D3B-F319D66FF92F@kenroku.kanazawa-u.ac.jp>

Hi,

>   Bio::Module.load(:html_alignment)

What is the benefit over
require 'bio/html_alignment' # no autoload by require 'bio'
?

>   Bio::Module.install(:html_alignment)
>
> will search the definition and install it.


I feel installation is easier from shell like:
$ ruby bioruby-inst-module html_alignment
but calling the Module.install internally is fine.

> (5) Modules can be updated outside Bioruby's mainline. e.g.  
> Bio::Module.install(:html_alignment,:development=>true)

We need to have a mechanism to check the versions between
the standard bioruby and the modules. Especially when the
mainline bioruby is updated.  Different modules perhaps will
have different level of dependency on the bioruby code, and
update in the main bioruby code sometimes may break the old
module.

-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan


From pjotr.public14 at thebird.nl  Fri Jan 22 05:49:00 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Fri, 22 Jan 2010 11:49:00 +0100
Subject: [BioRuby] Proposal: Bioruby modules (the bazaar)
In-Reply-To: <1B20D151-8246-47B1-8D3B-F319D66FF92F@kenroku.kanazawa-u.ac.jp>
References: <20100121162049.GB31462@thebird.nl>
	<DF2DA732-ABF3-492E-893A-FB5E06AA5289@hgc.jp>
	<20100122085508.GB12248@thebird.nl>
	<1B20D151-8246-47B1-8D3B-F319D66FF92F@kenroku.kanazawa-u.ac.jp>
Message-ID: <20100122104900.GB15628@thebird.nl>

On Fri, Jan 22, 2010 at 06:48:20PM +0900, Tomoaki NISHIYAMA wrote:
>>   Bio::Module.load(:html_alignment)
>
> What is the benefit over
> require 'bio/html_alignment' # no autoload by require 'bio'
> ?

A method allows more checking. I presume the module information will
be somewhere in a YAML file in the main tree. Or maintained through
git submodules.

>>   Bio::Module.install(:html_alignment)
>>
>> will search the definition and install it.
>
> I feel installation is easier from shell like:
> $ ruby bioruby-inst-module html_alignment
> but calling the Module.install internally is fine.

My example is for an interactive session. You only do it once (I
hope). Or when an author says he has updated his module.

>> (5) Modules can be updated outside Bioruby's mainline. e.g.  
>> Bio::Module.install(:html_alignment,:development=>true)
>
> We need to have a mechanism to check the versions between
> the standard bioruby and the modules. Especially when the
> mainline bioruby is updated.  Different modules perhaps will
> have different level of dependency on the bioruby code, and
> update in the main bioruby code sometimes may break the old
> module.

Well.

Bioruby should not care.

I think you misunderstand the purpose. Modules are *not* to be
supported from Bioruby. It is only a mechanism to make them easily
available. If things break, they break. That is why it is
developmental, or experimental.

The modules that are well 'supported' will come inside the
distribution.  Outside modules are up to the module maintainer.

Besides, you don't want to replace gems. If an author wants versioning
he can provide a gem (which, again, can be loaded as a Bioruby
module).

Once a module goes main stream versioning is moot. It just becomes
part of the Bioruby tree.

When everyone understands this a module can still support versioning.
But I think that ought to be done through gems.

Pj.

From andrew.j.grimm at gmail.com  Tue Jan 26 07:12:35 2010
From: andrew.j.grimm at gmail.com (Andrew Grimm)
Date: Tue, 26 Jan 2010 23:12:35 +1100
Subject: [BioRuby] Thread-safety of alignment
In-Reply-To: <20100120135046.158711CBC511@idnmail.gen-info.osaka-u.ac.jp>
References: <b9140daa1001200409u24a6cc7dgbc74cf4725b3f4e1@mail.gmail.com>
	<20100120135046.158711CBC511@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <b9140daa1001260412p7fc8582dgb87861906854494@mail.gmail.com>

Hi Naohisa Goto,

I tried creating a new factory in each thread, but I sometimes (but
not always) have errors.

Is the code in http://github.com/agrimm/bioruby-alignment-threading-replication/blob/master/test/test_multithreaded_alignment.rb
correct? Does it cause problems for anyone else?

Some of the errors I get include the ones seen at http://gist.github.com/286775

It's possible that the issues are caused by problems in tempfile
itself (which may have been fixed in August 2009 according to the
changelog).

Thanks,

Andrew

On Thu, Jan 21, 2010 at 12:50 AM, Naohisa GOTO
<ngoto at gen-info.osaka-u.ac.jp> wrote:
> Hi,
>
> On Wed, 20 Jan 2010 23:09:19 +1100
> Andrew Grimm <andrew.j.grimm at gmail.com> wrote:
>
>> Is alignment intended to be thread-safe in bioruby? If so, should I
>> use the same alignment factory between threads, or a separate one in
>> each thread?
>
> It is not confirmed to be thread-safe, so it is safe to use
> separate one in each thread.
>
> Currently, in BioRuby, manipulating the same object from different
> threads is not intended. When manipulating the same object from
> different threads is needed, using mutex is recommended.
>
> For library developers, it is encouraged to write thread-safe
> code if possible, but not mandatory.
>
> Naohisa Goto
> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
>
>>
>> Andrew
>> _______________________________________________
>> BioRuby Project - http://www.bioruby.org/
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
>
>

From ngoto at gen-info.osaka-u.ac.jp  Tue Jan 26 10:00:04 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Wed, 27 Jan 2010 00:00:04 +0900
Subject: [BioRuby] Thread-safety of alignment
In-Reply-To: <b9140daa1001260412p7fc8582dgb87861906854494@mail.gmail.com>
References: <b9140daa1001200409u24a6cc7dgbc74cf4725b3f4e1@mail.gmail.com>
	<20100120135046.158711CBC511@idnmail.gen-info.osaka-u.ac.jp>
	<b9140daa1001260412p7fc8582dgb87861906854494@mail.gmail.com>
Message-ID: <20100126150004.8B4AA1CBC3EC@idnmail.gen-info.osaka-u.ac.jp>

Hi Andrew,

On Tue, 26 Jan 2010 23:12:35 +1100
Andrew Grimm <andrew.j.grimm at gmail.com> wrote:

> Hi Naohisa Goto,
> 
> I tried creating a new factory in each thread, but I sometimes (but
> not always) have errors.

Please show ruby version and BioRuby version.
 % ruby -v
 % ruby -rbio -e 'puts Bio::BIORUBY_VERSION_ID'
(If you are using BioRuby 1.2.1 or earlier, 
 % ruby -rbio -e 'p Bio::BIORUBY_VERSION'
)

> Is the code in http://github.com/agrimm/bioruby-alignment-threading-replication/blob/master/test/test_multithreaded_alignment.rb
> correct? Does it cause problems for anyone else?

The "rescue RuntimeError" in line 15 may hide problems.
In my environment, it seems that the RuntimeError is raised
in lib/bio/alignment.rb. The error message I observed
without the rescue was
"alignment result is inconsistent with input data",
and output file created by Clustalw was unexpectedly empty.
It might be a bug of Tempfile in Ruby, but not sure.

With Ruby 1.8.7, errors are observed in some times.
  % ruby -v
  ruby 1.8.7 (2010-01-10 patchlevel 249) [i686-linux]
  ruby 1.8.7 (2009-04-08 patchlevel 160) [i686-linux]
  ruby 1.8.7 (2008-08-11 patchlevel 72) [i486-linux]

With Ruby 1.9.1-p378, no errors when I executed several times.
  % ruby -v
  ruby 1.9.1p378 (2010-01-10 revision 26273) [i686-linux]

> Some of the errors I get include the ones seen at http://gist.github.com/286775

The message "ERROR: Multiple sequences found with same name
(found 0 at least twice)!" is reported by ClustalW, and
it indicates incorrect input file sequence names. Maybe
two file contents are unexpectedly concatenated or mixed
possibly due to a bug of Tempfile, but not sure.

> It's possible that the issues are caused by problems in tempfile
> itself (which may have been fixed in August 2009 according to the
> changelog).

Another possibility is resource limits of the machine:
the number of child processes, total memory size, etc.
If exceeding limits, new child clustalw process could
not be started, or running clustalw processes might be
killed. This also causes void or truncated result files,
and leads to ruby-level errors.

Thanks,

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

> 
> Thanks,
> 
> Andrew
> 
> On Thu, Jan 21, 2010 at 12:50 AM, Naohisa GOTO
> <ngoto at gen-info.osaka-u.ac.jp> wrote:
> > Hi,
> >
> > On Wed, 20 Jan 2010 23:09:19 +1100
> > Andrew Grimm <andrew.j.grimm at gmail.com> wrote:
> >
> >> Is alignment intended to be thread-safe in bioruby? If so, should I
> >> use the same alignment factory between threads, or a separate one in
> >> each thread?
> >
> > It is not confirmed to be thread-safe, so it is safe to use
> > separate one in each thread.
> >
> > Currently, in BioRuby, manipulating the same object from different
> > threads is not intended. When manipulating the same object from
> > different threads is needed, using mutex is recommended.
> >
> > For library developers, it is encouraged to write thread-safe
> > code if possible, but not mandatory.
> >
> > Naohisa Goto
> > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
> >
> >>
> >> Andrew
> >> _______________________________________________
> >> BioRuby Project - http://www.bioruby.org/
> >> BioRuby mailing list
> >> BioRuby at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioruby
> >
> >


From andrew.j.grimm at gmail.com  Tue Jan 26 22:07:18 2010
From: andrew.j.grimm at gmail.com (Andrew Grimm)
Date: Wed, 27 Jan 2010 14:07:18 +1100
Subject: [BioRuby] Thread-safety of alignment
In-Reply-To: <20100126150004.8B4AA1CBC3EC@idnmail.gen-info.osaka-u.ac.jp>
References: <b9140daa1001200409u24a6cc7dgbc74cf4725b3f4e1@mail.gmail.com>
	<20100120135046.158711CBC511@idnmail.gen-info.osaka-u.ac.jp>
	<b9140daa1001260412p7fc8582dgb87861906854494@mail.gmail.com>
	<20100126150004.8B4AA1CBC3EC@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <b9140daa1001261907v674bd586ha96950baf87ff66e@mail.gmail.com>

Hi Naohisa Goto,

On Wed, Jan 27, 2010 at 2:00 AM, Naohisa GOTO
<ngoto at gen-info.osaka-u.ac.jp> wrote:
> Hi Andrew,
>
> On Tue, 26 Jan 2010 23:12:35 +1100
> Andrew Grimm <andrew.j.grimm at gmail.com> wrote:
>
>> Hi Naohisa Goto,
>>
>> I tried creating a new factory in each thread, but I sometimes (but
>> not always) have errors.
>
> Please show ruby version and BioRuby version.
> ?% ruby -v
> ?% ruby -rbio -e 'puts Bio::BIORUBY_VERSION_ID'
> (If you are using BioRuby 1.2.1 or earlier,
> ?% ruby -rbio -e 'p Bio::BIORUBY_VERSION'
> )
>

I'm running ruby 1.8.7 (2008-08-11 patchlevel 72) and bioruby 1.4.0.

>> Is the code in http://github.com/agrimm/bioruby-alignment-threading-replication/blob/master/test/test_multithreaded_alignment.rb
>> correct? Does it cause problems for anyone else?
>
> The "rescue RuntimeError" in line 15 may hide problems.
> In my environment, it seems that the RuntimeError is raised
> in lib/bio/alignment.rb. The error message I observed
> without the rescue was
> "alignment result is inconsistent with input data",
> and output file created by Clustalw was unexpectedly empty.
> It might be a bug of Tempfile in Ruby, but not sure.
>
> With Ruby 1.8.7, errors are observed in some times.
> ?% ruby -v
> ?ruby 1.8.7 (2010-01-10 patchlevel 249) [i686-linux]
> ?ruby 1.8.7 (2009-04-08 patchlevel 160) [i686-linux]
> ?ruby 1.8.7 (2008-08-11 patchlevel 72) [i486-linux]
>
> With Ruby 1.9.1-p378, no errors when I executed several times.
> ?% ruby -v
> ?ruby 1.9.1p378 (2010-01-10 revision 26273) [i686-linux]
>

I suspect errors may occur on earlier versions of ruby 1.9.1.

>> Some of the errors I get include the ones seen at http://gist.github.com/286775
>
> The message "ERROR: Multiple sequences found with same name
> (found 0 at least twice)!" is reported by ClustalW, and
> it indicates incorrect input file sequence names. Maybe
> two file contents are unexpectedly concatenated or mixed
> possibly due to a bug of Tempfile, but not sure.
>
>> It's possible that the issues are caused by problems in tempfile
>> itself (which may have been fixed in August 2009 according to the
>> changelog).
>
> Another possibility is resource limits of the machine:
> the number of child processes, total memory size, etc.
> If exceeding limits, new child clustalw process could
> not be started, or running clustalw processes might be
> killed. This also causes void or truncated result files,
> and leads to ruby-level errors.
>

Thanks for that suggestion. I re-ran the test using only 5 threads in
the new gist http://gist.github.com/287499

> Thanks,
>
> Naohisa Goto
> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
>
>>
>> Thanks,
>>
>> Andrew
>>
>> On Thu, Jan 21, 2010 at 12:50 AM, Naohisa GOTO
>> <ngoto at gen-info.osaka-u.ac.jp> wrote:
>> > Hi,
>> >
>> > On Wed, 20 Jan 2010 23:09:19 +1100
>> > Andrew Grimm <andrew.j.grimm at gmail.com> wrote:
>> >
>> >> Is alignment intended to be thread-safe in bioruby? If so, should I
>> >> use the same alignment factory between threads, or a separate one in
>> >> each thread?
>> >
>> > It is not confirmed to be thread-safe, so it is safe to use
>> > separate one in each thread.
>> >
>> > Currently, in BioRuby, manipulating the same object from different
>> > threads is not intended. When manipulating the same object from
>> > different threads is needed, using mutex is recommended.
>> >
>> > For library developers, it is encouraged to write thread-safe
>> > code if possible, but not mandatory.
>> >
>> > Naohisa Goto
>> > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
>> >
>> >>
>> >> Andrew
>> >> _______________________________________________
>> >> BioRuby Project - http://www.bioruby.org/
>> >> BioRuby mailing list
>> >> BioRuby at lists.open-bio.org
>> >> http://lists.open-bio.org/mailman/listinfo/bioruby
>> >
>> >
>
>


From missy at be.to  Fri Jan 29 01:46:15 2010
From: missy at be.to (MISHIMA, Hiroyuki)
Date: Fri, 29 Jan 2010 15:46:15 +0900
Subject: [BioRuby] Proposal: Bio::FastaFormat#each_entry
Message-ID: <4B628437.30305@be.to>

Hi all,

How about implementing the following methods?

	Bio::FastaFormat#each_entry
	Bio::FastaNumericFormat#each_entry

The following is a sample code to generate a FASTQ string from a FASTA 
string and a FASTA.QUAL string. This sample may need ruby 1.8.7 or later.

I am afraid that simpler or easier ways are already existed in BioRuby...

Hiro.

-----
#!/usr/local/bin/ruby
require 'rubygems'
require 'bio'

module Bio
   class FastaFormat
     def each_entry
       return to_enum(:each_entry) unless block_given?
       @continue = self.dup
       loop do
         yield @continue
         overrun = @continue.entry_overrun
         break unless overrun
         @continue = Bio::FastaFormat.new(overrun)
       end
     end
   end

   class FastaNumericFormat
     def each_entry
       return to_enum(:each_entry) unless block_given?
       @continue = self.dup
       loop do
         yield @continue
         overrun = @continue.entry_overrun
         break unless overrun
         @continue = Bio::FastaNumericFormat.new(overrun)
       end
     end
   end
end

fasta = <<EOS
>FXQB1I00000001
TATGGAATCTGTAGAATCAGTGGTAGGTGCAGCAGATGGAGGAAGG
>FXQB1I00000002
CTGGAGAATTCTGGATCCTCGACTTATGACTTGGTGGTTCTGGTAACTGTGAGCTTAGGATAGTCAG
EOS

qual = <<EOS
>FXQB1I00000001
30 30 29 42 25 24 5 30 30 30 30 30 28 30 26 9 30 30 30 30 30 42 25 30 30 
42 25 29 22 30 29 26 30 30 30 29 30 42 25 30 32 17 40 23 39 24
>FXQB1I00000002
30 30 33 19 28 30 26 9 32 12 30 30 33 20 30 30 32 15 27 27 30 28 28 34 
22 27 22 28 28 29 26 9 33 19 22 43 25 33 19 28 27 32 15 30 32 12 28 30 
27 30 30 26 27 30 40 23 30 40 23 30 29 29 30 30 30 29 30
EOS

enum_fasta = Bio::FastaFormat.new(fasta).each_entry
enum_qual = Bio::FastaNumericFormat.new(qual).each_entry

loop do
   fastq = Bio::Sequence.adapter(enum_fasta.next,
                                 Bio::Sequence::Adapter::Fastq)
   fastq.quality_score_type = :phred
   fastq.quality_scores = enum_qual.next.data
   puts fastq.output(:fastq)
end

-- 
MISHIMA, Hiroyuki, DDS, Ph.D.
COE Research Fellow
Department of Human Genetics
Nagasaki University Graduate School of Biomedical Sciences

From ngoto at gen-info.osaka-u.ac.jp  Fri Jan 29 05:25:29 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Fri, 29 Jan 2010 19:25:29 +0900
Subject: [BioRuby] Proposal: Bio::FastaFormat#each_entry
In-Reply-To: <4B628437.30305@be.to>
References: <4B628437.30305@be.to>
Message-ID: <20100129102530.3BBEA1CBC436@idnmail.gen-info.osaka-u.ac.jp>

Hi,

On Fri, 29 Jan 2010 15:46:15 +0900
"MISHIMA, Hiroyuki" <missy at be.to> wrote:

> Hi all,
> 
> How about implementing the following methods?
> 
> 	Bio::FastaFormat#each_entry
> 	Bio::FastaNumericFormat#each_entry
> 
> The following is a sample code to generate a FASTQ string from a FASTA 
> string and a FASTA.QUAL string. This sample may need ruby 1.8.7 or later.
>
> I am afraid that simpler or easier ways are already existed in BioRuby...

I think mixing single entry parser with multiple entry iterator
will cause confusion, and not good way.

For most parser classes in bioruby, expected data source is
String containing single entry data. In addition, for IO with
possible multiple entries, Bio::FlatFile is the front-end that
can detect data type, splits each entry, and calling assigned
parser class.

For String containing multiple entries, using StringIO and
then Bio::FlatFile is the easiest way, although indirect.
Recently, many efficient memory-mapped data transfer methods
are available, e.g. memcached, IPC shared memory, mmap(2)
system call. I'm now thinking how to treat such data efficiently.

Below is an example using StringIO and Bio::FlatFile.
#------------------------------------------------
  require 'stringio'
  require 'bio'

  # When copy-and paste this script, the "> " in the head of
  # each line should be removed. 
> fasta = <<EOS
> >FXQB1I00000001
> TATGGAATCTGTAGAATCAGTGGTAGGTGCAGCAGATGGAGGAAGG
> >FXQB1I00000002
> CTGGAGAATTCTGGATCCTCGACTTATGACTTGGTGGTTCTGGTAACTGTGAGCTTAGGATAGTCAG
> EOS
> 
> qual = <<EOS
> >FXQB1I00000001
> 30 30 29 42 25 24 5 30 30 30 30 30 28 30 26 9 30 30 30 30 30 42 25 30 30 
> 42 25 29 22 30 29 26 30 30 30 29 30 42 25 30 32 17 40 23 39 24
> >FXQB1I00000002
> 30 30 33 19 28 30 26 9 32 12 30 30 33 20 30 30 32 15 27 27 30 28 28 34 
> 22 27 22 28 28 29 26 9 33 19 22 43 25 33 19 28 27 32 15 30 32 12 28 30 
> 27 30 30 26 27 30 40 23 30 40 23 30 29 29 30 30 30 29 30
> EOS
  
  ff_fasta = Bio::FlatFile.open(StringIO.new(fasta))
  ff_qual = Bio::FlatFile.open(StringIO.new(qual))

  while entry_fasta = ff.fasta.next_entry
    seq = entry_fasta.to_biosequence
    seq.quality_score_type = :phred
    seq.quality_scores = ff_qual.next_entry.data
    puts fastq.output(:fastq, :title => entry_fasta.definition)
  end
#------------------------------------------------

> enum_fasta = Bio::FastaFormat.new(fasta).each_entry
> enum_qual = Bio::FastaNumericFormat.new(qual).each_entry
> 
> loop do
>    fastq = Bio::Sequence.adapter(enum_fasta.next,
>                                  Bio::Sequence::Adapter::Fastq)
>    fastq.quality_score_type = :phred
>    fastq.quality_scores = enum_qual.next.data
>    puts fastq.output(:fastq)
> end

Bio::Sequence.adapter is bioruby library internal use only,
and normally should not be used by user scripts. In addition,
using Adapter::Fastq for Bio::FastaFormat data is mismatch. 
In this case, use Bio::FastaFormat#to_biosequence.

> 
> -- 
> MISHIMA, Hiroyuki, DDS, Ph.D.
> COE Research Fellow
> Department of Human Genetics
> Nagasaki University Graduate School of Biomedical Sciences

Thanks,

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org


From missy at be.to  Fri Jan 29 06:24:15 2010
From: missy at be.to (MISHIMA, Hiroyuki)
Date: Fri, 29 Jan 2010 20:24:15 +0900
Subject: [BioRuby] Proposal: Bio::FastaFormat#each_entry
In-Reply-To: <20100129102530.3BBEA1CBC436@idnmail.gen-info.osaka-u.ac.jp>
References: <4B628437.30305@be.to>
	<20100129102530.3BBEA1CBC436@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <4B62C55F.1050506@be.to>

Hi, Naohisa GOTO,

Thank you so much for detailed explanation and a sample code. It was big
help for me to understand BioRuby's overall design.

Although I used here-documents in my code, what I wanted to do was just
make a FASTQ file from regular FASTA and FASTA.QUAL files.

I tried your code using my relatively large input files. It was much
faster than my code.

The final code is simply the following:
----
require 'bio'

ff_fasta = Bio::FlatFile.open(ARGV[0])
ff_qual = Bio::FlatFile.open(ARGV[0]+".qual")

while entry_fasta = ff_fasta.next_entry
   seq = entry_fasta.to_biosequence
   seq.quality_score_type = :phred
   seq.quality_scores = ff_qual.next_entry.data
   puts seq.output(:fastq, :title => entry_fasta.definition)
end
----

Hiro.

Naohisa GOTO wrote (2010/01/29 19:25):
> Hi,
>
> On Fri, 29 Jan 2010 15:46:15 +0900
> "MISHIMA, Hiroyuki"<missy at be.to>  wrote:
>
>> Hi all,
>>
>> How about implementing the following methods?
>>
>> 	Bio::FastaFormat#each_entry
>> 	Bio::FastaNumericFormat#each_entry
>>
>> The following is a sample code to generate a FASTQ string from a FASTA
>> string and a FASTA.QUAL string. This sample may need ruby 1.8.7 or later.
>>
>> I am afraid that simpler or easier ways are already existed in BioRuby...
>
> I think mixing single entry parser with multiple entry iterator
> will cause confusion, and not good way.
>
> For most parser classes in bioruby, expected data source is
> String containing single entry data. In addition, for IO with
> possible multiple entries, Bio::FlatFile is the front-end that
> can detect data type, splits each entry, and calling assigned
> parser class.
>
> For String containing multiple entries, using StringIO and
> then Bio::FlatFile is the easiest way, although indirect.
> Recently, many efficient memory-mapped data transfer methods
> are available, e.g. memcached, IPC shared memory, mmap(2)
> system call. I'm now thinking how to treat such data efficiently.

-- 
MISHIMA, Hiroyuki, DDS, Ph.D.
COE Research Fellow
Department of Human Genetics
Nagasaki University Graduate School of Biomedical Sciences

From biopython at maubp.freeserve.co.uk  Fri Jan 29 05:36:40 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 29 Jan 2010 10:36:40 +0000
Subject: [BioRuby] [Bioperl-l] [MOBY-dev] OpenBio solution challenge:
	Project updates at BOSC 2010
In-Reply-To: <op.u69hfujinbznux@dd0710001l.icapture.ubc.ca>
References: <20100128203505.GG40046@sobchak.mgh.harvard.edu>
	<op.u69hfujinbznux@dd0710001l.icapture.ubc.ca>
Message-ID: <320fb6e01001290236l1ad02515w403a19f94dbb6d15@mail.gmail.com>

Hi all,

This is a great topic but should be continue it on just the one mailing list?
Is there a suitable BOSC list, or how about the general Open Bio list?

On Thu, Jan 28, 2010 at 9:17 PM, Mark Wilkinson <markw at illuminae.com> wrote:
>
> Brad, this sounds exciting!
>
> One thing strikes me, though - by asking for the sub-projects to propose
> the "grand challenge" themselves the one thing you can guarantee is that
> the "grand challenge" is solvable (or more likely, already solved!)
>
> Other "grand challenge" kinds of meetings have an independent third party
> pose the problem that has to be solved, and then all groups work toward a
> solution and compare their results. ?This would, IMO, be more revealing of
> the "state of the art" in each Open-Bio project, and point out where the
> weaknesses are that we should be focusing on... ?Someone (for example,
> you!) could act as the moderator to ensure that the "grand challenge" was
> at least a reasonable one, within the scope of what an Open-Bio project
> *should* be able to solve...
>
> Just my CAD $0.02
>
> Mark

One possible problem with having Brad act as moderator is his ties to
Biopython (plus it would be a shame if we'd be one man down for trying
to solve the challenges - grin). Having a project representative "sign off"
on the challenge might work - or simply the whole of the BOSC committee
which is quite balanced. Alternatively some kind of panel of challenges does
seem a good way to reduce individual project bias (as suggest by Scooter),
but there will still need to be a judging committee.

I'm curious what kind of challenges the BOSC committee had in mind -
would something like taking a newly sequence bacteria and producing
an automated annotation as a GenBank, EMBL, or GFF  file be too
ambitious for example? There are already several major projects
to do this e.g. RAST http://rast.nmpdr.org/

Peter
(@Biopython)


From ngoto at gen-info.osaka-u.ac.jp  Mon Jan  4 07:15:18 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Mon, 4 Jan 2010 16:15:18 +0900
Subject: [BioRuby] Codeml parser
In-Reply-To: <20091231141546.GA5770@thebird.nl>
References: <20091231141546.GA5770@thebird.nl>
Message-ID: <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp>

Hi,

I also think the current Bio::PAML::Codeml::Report is needed to be
rewritten. It is great if you do so. Here is my comments.

>  codeml = Bio::PAML::Codeml.new(nil, :runmode => 0, :RateAncestor => 1,
>                                      :alpha => 0.5, :fix_alpha => 0)
>  report = codeml.query(alignment, tree)
>
> which, as it happens, works. The 'nil' points to the program executable.
> 'nil' merely fills in 'codeml'. It would have been beter to make it one
> of the listed options, e.g. :binary => 'codeml'. That would save the ugly
> 'nil' parameter and belongs more to the principle of least surprise, that
> makes Ruby shine.

It is safe not to merge bioruby internal options and PAML's options.
If the upstream authors of PAML introduced a new option named binary,
severe problem would occur.

One way is to write a code that acts something like C++ polymorphism.
For example, the code below accepts the three cases.
* Bio::PAML::Codeml.new("/path/to/codeml")
* Bio::PAML::Codeml.new({ :xxx => yyy, :ppp => qqq })
* Bio::PAML::Codeml.new("/path/to/codeml", { :xxx => yyy, :ppp => qqq })

  def initialize(*argv)
    program = nil
    params = {}
    case argv.size
    when 0, 1
      begin
        params = argv[0].to_hash
      rescue NoMethodError
        program = argv[0]
      end
    when 2
      program, params = *argv
    else
      raise ArgumentError, "wrong number of arguments (#{argv.size} for 2)"
    end
    # continues to the current code...

The bad points are:
* Complexity of code is increased.
* It might make difficult to refactor codes, especially when keyword
   arguments are introduced in the future version of Ruby.

Note that Ruby's author Matz has said that he had not applied the
principle of least surprise to the design of Ruby.
(http://en.wikipedia.org/wiki/Ruby_(programming_language)#Philosophy )
Please be careful that the word "principle of least surprise (POLS)"
is NG word when you request something in Ruby.
(http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/26942 )

>  A new implementation of Bio::PAML::Codeml::Report

> So I propose to rewrite the class supporting for multiple models,
> with the following usage (starting from a codeml report - really result):
>
> >> report.models.size
> => 2
> >> report.models[0].name
> => "M0"

I suppose report.models returns a Hash containing objects of newly written
class (for example, Bio::PAML::Codeml::Report::Model) or Struct.
It seems good.

Existing methods could be changed to return the first model's values.

> Unit tests

Currently, tests with external dependencies (e.g. web services) are
located in the test/functional/ directory. So, your tests running
codeml would be named test/functional/bio/appl/paml/test_codeml.rb,
test/functional/bio/appl/paml/codeml/test_report.rb, or something
like this.

> These tests, for example, can be run on a special switch:
>
>  runner.rb --test-dependencies

I'm now searching ways to pass such parameters to tests.
Note that tests can also be run in various ways. For example,
  ruby test/unit/bio/appl/paml/codeml/test_report.rb 
  testrb test/unit/bio/appl/paml/codeml
  rake test

> I am sure it works, but doesn't anyone think this belongs in a support
> module (e.g. BioTestFile) for testing? What I would like to see is
> something less brittle:
>
>  require 'bio/test'
>  str = BioTestFile::read('paml/codeml/output.txt')

I'd like to keep tests simple and clear, and I think using standard
File.read is enough and clearer. When using such special class, to know
the behavior of the test code, reading extra file is needed.

> Personally, I dislike the naming/name space scheme of Bioruby.
> What to think of invoking a class named
>
>  report = Bio::PAML::Codeml::Report.new

Because there are many bioinformatics software and databases, names
tends to be longer, and nesting of namespace tends to be deeper.
I'd like to know naming rules and policies of other open-bio projects.

> Why can't it just be
>
>  include Bio
>  report = Codeml.new

I think it is enough to write "include Bio::PAML" instead of (or in
addition to) "include Bio".

>  include Bio
>  result = Paml.new(:program => 'codeml')

I don't like introducing such new parameter like :program.
I think 1 class 1 binary is better.

In addition, because the differences within PAML tools (codeml, baseml,
yn00, etc.) are currently not small, merging the classes is not so
realistic now.

On Thu, 31 Dec 2009 15:15:46 +0100
Pjotr Prins <pjotr.public14 at thebird.nl> wrote:

> Hi Michael,
> 
> I have a writeup on improving the current PAML functionality. Are you
> OK with this?
> 
>   http://bioruby.open-bio.org/wiki/BIORUBY_PAML
> 
> (maybe it does not belong on the bioruby Wiki - but I think of it
> like a 'design' document).
> 
> Pj.
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org


From pjotr.public14 at thebird.nl  Mon Jan  4 09:03:18 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Mon, 4 Jan 2010 10:03:18 +0100
Subject: [BioRuby] Bioruby design
Message-ID: <20100104090318.GA16136@thebird.nl>

Thanks for the reply Naohisa. As we are moving on to design, rather
than one implementation I am changing the thread.

On Mon, Jan 04, 2010 at 04:15:18PM +0900, Naohisa GOTO wrote:
> It is safe not to merge bioruby internal options and PAML's options.
> If the upstream authors of PAML introduced a new option named binary,
> severe problem would occur.

I am against breaking interfaces. This is a minor design problem
which should be avoided in the future. And, yes, I would certainly
not favour a polymorphism solution, unless unavoidable. 

I don't think it is worth 'fixing' this interface aspect at this stage. 

Perhaps, there will be opportunities later.

> Note that Ruby's author Matz has said that he had not applied the
> principle of least surprise to the design of Ruby.
> (http://en.wikipedia.org/wiki/Ruby_(programming_language)#Philosophy )
> Please be careful that the word "principle of least surprise (POLS)"
> is NG word when you request something in Ruby.
> (http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/26942 )

I did not know that, and personally I do not care. I think POLS is a
really good idea, though it should not automatically come at the
expense of (for example) convenience, or performance. I favour easy
API's, and that is where the principle of least surprise comes in. It
means to me that I don't have to fetch the manuals every time (like I
do with Perl). So, let's not throw away the baby with the bath water.

I like POLS, as much as I like KISS.

> > >> report.models[0].name
> > => "M0"
> 
> I suppose report.models returns a Hash containing objects of newly written
> class (for example, Bio::PAML::Codeml::Report::Model) or Struct.
> It seems good.

In fact, I have made it an array. See my PAML branch.

> >  runner.rb --test-dependencies
> 
> I'm now searching ways to pass such parameters to tests.

In the runner you can parse the parameters first and pull them off
the stack. I did something like that for cfruby:

  http://cfruby.rubyforge.org/git?p=cfruby.git;a=blob;f=test/runner.rb;h=c202e48783a744c4cb3e339e2b891b3eab354c3e;hb=HEAD

 
> I'd like to keep tests simple and clear, and I think using standard
> File.read is enough and clearer. When using such special class, to know
> the behavior of the test code, reading extra file is needed.

I disagree, but that is obvious. 

> > Personally, I dislike the naming/name space scheme of Bioruby.
> > What to think of invoking a class named
> >
> >  report = Bio::PAML::Codeml::Report.new
> 
> Because there are many bioinformatics software and databases, names
> tends to be longer, and nesting of namespace tends to be deeper.
> I'd like to know naming rules and policies of other open-bio projects.

I think we should not mirror ourselves on these. We can do better.
RoR is a much better example to mirror ourselves on.

> > Why can't it just be
> >
> >  include Bio
> >  report = Codeml.new
> 
> I think it is enough to write "include Bio::PAML" instead of (or in
> addition to) "include Bio".

Not really. It brings in another source of errors for users if they
have to think about that context every time. We will get all
variants, like Bio::Kegg, Bio::Sequence etc.

I think name spaces are there to *avoid* conflict. If a naming scheme
precludes conflict, why bring in another layer?

I want Bioruby to be as easy as possible, and with the least
amount of typing. More text = harder to read.

> >  include Bio
> >  result = Paml.new(:program => 'codeml')
> 
> I don't like introducing such new parameter like :program.
> I think 1 class 1 binary is better.

I agree. It was just another option.

> In addition, because the differences within PAML tools (codeml, baseml,
> yn00, etc.) are currently not small, merging the classes is not so
> realistic now.

We have to separate our own conveniences from design choices.

Meanwhile I do agree we should not change the current interfaces. We
can create a new version of Bioruby with both old and new interfaces
supported. That is one thing I propose.

I am putting together a discussion document on the future of Bioruby
(design choices). We will have opportunity to discuss that in Japan.
We can consider raising a community vote once we have a list of
options.

Pj.


From pjotr.public14 at thebird.nl  Mon Jan  4 11:51:05 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Mon, 4 Jan 2010 12:51:05 +0100
Subject: [BioRuby] Codeml parser
In-Reply-To: <20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp>
References: <20091231141546.GA5770@thebird.nl>
	<20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <20100104115105.GA21035@thebird.nl>

I have updated the writeup at

  http://bioruby.open-bio.org/wiki/BIORUBY_PAML

have a look at my PAML branch. The (old) unit tests pass.

  http://github.com/pjotrp/bioruby/tree/PAML

I have to add the positive selection sites, to complete it.

Pj.


From tomoakin at kenroku.kanazawa-u.ac.jp  Mon Jan  4 12:33:20 2010
From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA)
Date: Mon, 4 Jan 2010 21:33:20 +0900
Subject: [BioRuby] Bioruby design
In-Reply-To: <20100104090318.GA16136@thebird.nl>
References: <20100104090318.GA16136@thebird.nl>
Message-ID: <14BB7DE3-1D1B-4180-91EE-156EAADF4A98@kenroku.kanazawa-u.ac.jp>

Hi,

 > As people tend not to think of Paml as a toolbox I would prefer

 > to have one object names Paml. With behind it the codeml 'engine'

 > and reporter. This would work for me (also note Paml does

 > not return a report, but rather a result):


I don't agree in this point.  PHYLIP is clearly a package or  
collection of
programs, and so is considered Molphy, PAML, ...

 > result = Paml.new(:program => 'codeml')
And if you make a single object, it is not to obvious to divide based  
on the program,
since aaml is now done by codeml but should be considered clearly  
different
function.

>>>  include Bio
>>>  report = Codeml.new
>>>
>>
>> I think it is enough to write "include Bio::PAML" instead of (or in
>> addition to) "include Bio".
>>
>
> Not really. It brings in another source of errors for users if they
> have to think about that context every time. We will get all
> variants, like Bio::Kegg, Bio::Sequence etc.


These are short enought, since we have to write something like
"PAML ver XXX (Yang, XX) was used for XX" and "KEGG (Kanehisa, XXX)"...
in the manuscript of the paper if we use that module.
Stating their use explicitly in the first lines of the
program is considered good.

On the other hand, I don't like "include Bio::Sequence", since it is  
a function
of bioruby in itself.
-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan


From pjotr.public14 at thebird.nl  Mon Jan  4 15:04:59 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Mon, 4 Jan 2010 16:04:59 +0100
Subject: [BioRuby] Bioruby design
In-Reply-To: <14BB7DE3-1D1B-4180-91EE-156EAADF4A98@kenroku.kanazawa-u.ac.jp>
References: <20100104090318.GA16136@thebird.nl>
	<14BB7DE3-1D1B-4180-91EE-156EAADF4A98@kenroku.kanazawa-u.ac.jp>
Message-ID: <20100104150459.GB21412@thebird.nl>

On Mon, Jan 04, 2010 at 09:33:20PM +0900, Tomoaki NISHIYAMA wrote:
> These are short enought, since we have to write something like
> "PAML ver XXX (Yang, XX) was used for XX" and "KEGG (Kanehisa, XXX)"...
> in the manuscript of the paper if we use that module.
> Stating their use explicitly in the first lines of the
> program is considered good.

Uhm. I think that is a bit far fetched. The way you propose it is
that you would have to load the name space every time you use
something in code:

  require 'bio'

  include Bio::PAML
  include Bio::Kegg
  include ...
  
  do something

next source file, the same. And again:

  require 'bio'

  include Bio::PAML
  include Bio::Kegg
  include ...
  
  do something

This is the philosophy of Python - where every source file explicitly
loads all modules/name spaces.

It is arguably 'clear'. But ugly. And, takes the fun out of
programming (anyone mention that?).

Only once I have used the Python name spacing with good effect. It was
when we plugged in a replacement module - completely rewritten. That
was changing one line only - and it worked :-). In Python you can say

  import Paml as paml

it became

  import Paml2 as paml

That was nice. But whan you see Python source files, the header is
ugly, and wastes a lot of typing. See for example:

  http://pypi.python.org/pypi/zope.sqlalchemy#example

I argue not to state imports. import Bio should be part of 

  require 'bio'

Anyway, we will have time to talk in Tokyo, I hope. 

Pj.


P.S. Do you have an example of anyone quoting a Bioruby module in a
paper?


From pjotr.public14 at thebird.nl  Mon Jan  4 17:09:04 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Mon, 4 Jan 2010 18:09:04 +0100
Subject: [BioRuby] Codeml parser
In-Reply-To: <20100104115105.GA21035@thebird.nl>
References: <20091231141546.GA5770@thebird.nl>
	<20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp>
	<20100104115105.GA21035@thebird.nl>
Message-ID: <20100104170904.GA26187@thebird.nl>

The writeup is pretty much done, as well as the implementation.

  http://bioruby.open-bio.org/wiki/BIORUBY_PAML

All unit tests pass:

  Running tests for PAML
  Loaded suite .
  Started
  ....................
  Finished in 0.398394 seconds.
  20 tests, 37 assertions, 0 failures, 0 errors

It is compatible with the old version. I have added 41 assertions
in the doctest (the header of report.rb).

  === Testing 'mydoc.test'...
  1.   OK  | Default Test
  41 comparisons, 1 doctests, 0 failures, 0 errors

You can view the tests and implementation at

  http://github.com/pjotrp/bioruby/blob/PAML/lib/bio/appl/paml/codeml/report.rb
See also 

The branch is:

  http://github.com/pjotrp/bioruby/tree/PAML

(don't you love github).

Pj.


From mail at michaelbarton.me.uk  Mon Jan  4 17:50:50 2010
From: mail at michaelbarton.me.uk (Michael Barton)
Date: Mon, 4 Jan 2010 12:50:50 -0500
Subject: [BioRuby] Codeml parser
In-Reply-To: <20100104170904.GA26187@thebird.nl>
References: <20091231141546.GA5770@thebird.nl>
	<20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> 
	<20100104115105.GA21035@thebird.nl> <20100104170904.GA26187@thebird.nl>
Message-ID: <c27b73c1001040950h390b0d0ej201b59e1a071fe35@mail.gmail.com>

Hi Pjotr,

The expand report.rb looks like an excellent and substantial
improvement to the previous version. You could add a depreciated tag
to the old interface methods and these could then be removed in a
later bioruby version to decrease clutter in the API.

Mike

2010/1/4 Pjotr Prins <pjotr.public14 at thebird.nl>:
> The writeup is pretty much done, as well as the implementation.
>
> ?http://bioruby.open-bio.org/wiki/BIORUBY_PAML
>
> All unit tests pass:
>
> ?Running tests for PAML
> ?Loaded suite .
> ?Started
> ?....................
> ?Finished in 0.398394 seconds.
> ?20 tests, 37 assertions, 0 failures, 0 errors
>
> It is compatible with the old version. I have added 41 assertions
> in the doctest (the header of report.rb).
>
> ?=== Testing 'mydoc.test'...
> ?1. ? OK ?| Default Test
> ?41 comparisons, 1 doctests, 0 failures, 0 errors
>
> You can view the tests and implementation at
>
> ?http://github.com/pjotrp/bioruby/blob/PAML/lib/bio/appl/paml/codeml/report.rb
> See also
>
> The branch is:
>
> ?http://github.com/pjotrp/bioruby/tree/PAML
>
> (don't you love github).
>
> Pj.
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>


From ngoto at gen-info.osaka-u.ac.jp  Tue Jan  5 07:42:49 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Tue, 5 Jan 2010 16:42:49 +0900
Subject: [BioRuby] Codeml parser
In-Reply-To: <20100104170904.GA26187@thebird.nl>
References: <20091231141546.GA5770@thebird.nl>
	<20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp>
	<20100104115105.GA21035@thebird.nl>
	<20100104170904.GA26187@thebird.nl>
Message-ID: <20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp>

Hi Pjotr,

I'm reading the code (commit c2de9dd3ad055bab4bfb1d3e8da840493b110b0e).
It is generally good. Below are my comments and suggested changes.

>    # == Examples
>    #
>    # Read the codeml M0-M3 data file into a buffer
>    #
>    # >> require 'bio/test/biotestfile'
>    # >> buf = BioTestFile.read('paml/codeml/models/results0-3.txt')

It is not suitable to use such nonstandard class in the example.
Users want to know the example usage and do not intend to test.
Note that I still disagree with the BioTestFile class.

>    class Report < Bio::PAML::Common::Report
> 
>      attr_reader :models, :header, :footer

RDoc documentation is also needed for attributes. To write RDoc,
the three attribute definitions are needed to be separated.
For example,

      # Models in the result
      # (Array containing Bio::PAML::Codeml::Model objects)
      attr_reader :models

      # ...(should be written)
      attr_reader :header

      # ...(should be written)
      attr_reader :footer

>      # Parse codeml output file passed with +buf+
>      def initialize buf

Details of +buf+ (class, contents, etc) should also be written in RDoc.
It is recommended to use the style written in the README_DEV.rdoc, or
the style used in the Ruby source code.

Please do not omit parentheses in the method definition lines.

>    # Model class
>    class Model 

Too few documentation. At least please write a message that it is
created by Bio::PAML::Codeml::Report.

>      def initialize buf

Please write RDoc that normal users do not use the method directly,
and internally called inside the Bio::PAML::Codeml::Report objects.

Please do not omit parentheses in the method definition lines.

>      def lnL

Writing RDoc document is needed. In addition, for omega, kappa, alpha,
tree_length, tree, and to_s methods.

>    class PositiveSite

Almost all methods have no RDoc documantation.

>      def to_a
>        [ @position, @aaref, @probability, @omega ]
>      end

What is the purpose of the method?

>    class PositiveSites < Array

To inherit Array and to create original container class is discouraged.
In BioRuby, we have deprecated Bio::Features and Bio::References in
version 1.3.0, although they do not inherit Array but have an array
in the object. (The classes still exist only for backward compatibility,
in lib/bio/compat/features.rb and references.rb).

In this case, except initialize, only a method named "graph" is added.
I think it is good to add the graph method in the Report class and
using an Array for storing PositiveSite objects.

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org


From pjotr.public14 at thebird.nl  Tue Jan  5 10:32:12 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Tue, 5 Jan 2010 11:32:12 +0100
Subject: [BioRuby] Codeml parser
In-Reply-To: <20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp>
References: <20091231141546.GA5770@thebird.nl>
	<20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp>
	<20100104115105.GA21035@thebird.nl>
	<20100104170904.GA26187@thebird.nl>
	<20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <20100105103212.GA4584@thebird.nl>

Hi Naohisa,

First I thought you were kidding. But then I realise you are serious.

I don't think we need to document every simple class variable/accessor
to accept this source code. That is overkill. If you don't understand
lnL or alpha, don't use it. We are not in the business of documenting
for documenting's sake.  Documenting lnL and alpha will be like:

"Retrieve the lnL value from the Report" 

"Retrieve the alpha value from the Report" 

etc. etc. I don't think we should be doing that. Standard 1?to-1
relations are obvious and don't need lots of text in the code base.

If someone feels like filling in these obvious statements, fine. It
really goes against my grain. Do we document every single accessor?
Note the previous implementation did no such thing. That code was
accepted fine (and partially written by you).

> Details of +buf+ (class, contents, etc) should also be written in RDoc.
> It is recommended to use the style written in the README_DEV.rdoc, or
> the style used in the Ruby source code.

You mean the contents of the input buffer, which is the content of the
input file? I see many places in Bioruby where no such a thing is
done.  Why become strict on this now? If you want a different
descriptive name for the variable - that is fine. Propose me
a better name.

> >      def to_a
> >        [ @position, @aaref, @probability, @omega ]
> >      end
> What is the purpose of the method?

Access converter. Convenience, really. You can remove it if you
dislike it so much. I use it for testing and to write to a file. Could
be to_s too, but that fixates the format.

> >    class PositiveSites < Array
> 
> To inherit Array and to create original container class is discouraged.
> In BioRuby, we have deprecated Bio::Features and Bio::References in
> version 1.3.0, although they do not inherit Array but have an array
> in the object. (The classes still exist only for backward compatibility,
> in lib/bio/compat/features.rb and references.rb).

PositiveSites object has the all the features of a list (ie Array). I
think inheritance is what it should be. It is an is_a relationship.
Adding a @list will just add code. Not only for initialization, but
also for iterators. I only see how we can move backwards from readable
code. Nor is it good OOP practice. Inheritance is not *always* bad,
though I agree it is used too quickly (in general).

> In this case, except initialize, only a method named "graph" is added.
> I think it is good to add the graph method in the Report class and
> using an Array for storing PositiveSite objects.

This is awful. The graph is a feature of PositiveSites, and not of the
report *parser*. To keep things simple it is best practise to have
functionality where it belongs. It is good OOP design. Your proposal
means the Report class becomes less obvious in what it is. Look how
clean it is now!

What do other people think on this list. I am at a disadvantage here.

I would like this code accepted in Bioruby, so other people can use
it. I disagree with most of above 'criticism'. I certainly balk at the
last non-OOP ones. This is not the first time I am really unhappy. I
can't believe how much trouble I have to go to for a simple class,
which, as it happens, has a perfectly acceptable implementation by
most measures.

Pj.


From jan.aerts at gmail.com  Tue Jan  5 11:53:53 2010
From: jan.aerts at gmail.com (Jan Aerts)
Date: Tue, 5 Jan 2010 11:53:53 +0000
Subject: [BioRuby] Codeml parser
In-Reply-To: <20100105103212.GA4584@thebird.nl>
References: <20091231141546.GA5770@thebird.nl>
	<20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp>
	<20100104115105.GA21035@thebird.nl>
	<20100104170904.GA26187@thebird.nl>
	<20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp>
	<20100105103212.GA4584@thebird.nl>
Message-ID: <4c7507a71001050353v4e4654abn79f9cbcaeef262e2@mail.gmail.com>

All,

It appears that the pre-hackathon bioruby meeting will be very useful :-)
Why don't we use that time to focus on the bit-more-distant future of
bioruby: bioruby 2.0? We could discuss what it should look like without
having to worry about backward compatibility. Topics:
* documentation style (I happen to agree with Naohisa on that)
* class hierarchy: how would we organize the information if we had to start
from scratch? (maybe we should follow bioperl's lead with a Root class?)
* coding style
* general interface decisions
* ...

jan.

PS: Still don't know if I can make it to Japan. Will know this afternoon
(broken foot might interfere...)

2010/1/5 Pjotr Prins <pjotr.public14 at thebird.nl>

> Hi Naohisa,
>
> First I thought you were kidding. But then I realise you are serious.
>
> I don't think we need to document every simple class variable/accessor
> to accept this source code. That is overkill. If you don't understand
> lnL or alpha, don't use it. We are not in the business of documenting
> for documenting's sake.  Documenting lnL and alpha will be like:
>
> "Retrieve the lnL value from the Report"
>
> "Retrieve the alpha value from the Report"
>
> etc. etc. I don't think we should be doing that. Standard 1?to-1
> relations are obvious and don't need lots of text in the code base.
>
> If someone feels like filling in these obvious statements, fine. It
> really goes against my grain. Do we document every single accessor?
> Note the previous implementation did no such thing. That code was
> accepted fine (and partially written by you).
>
> > Details of +buf+ (class, contents, etc) should also be written in RDoc.
> > It is recommended to use the style written in the README_DEV.rdoc, or
> > the style used in the Ruby source code.
>
> You mean the contents of the input buffer, which is the content of the
> input file? I see many places in Bioruby where no such a thing is
> done.  Why become strict on this now? If you want a different
> descriptive name for the variable - that is fine. Propose me
> a better name.
>
> > >      def to_a
> > >        [ @position, @aaref, @probability, @omega ]
> > >      end
> > What is the purpose of the method?
>
> Access converter. Convenience, really. You can remove it if you
> dislike it so much. I use it for testing and to write to a file. Could
> be to_s too, but that fixates the format.
>
> > >    class PositiveSites < Array
> >
> > To inherit Array and to create original container class is discouraged.
> > In BioRuby, we have deprecated Bio::Features and Bio::References in
> > version 1.3.0, although they do not inherit Array but have an array
> > in the object. (The classes still exist only for backward compatibility,
> > in lib/bio/compat/features.rb and references.rb).
>
> PositiveSites object has the all the features of a list (ie Array). I
> think inheritance is what it should be. It is an is_a relationship.
> Adding a @list will just add code. Not only for initialization, but
> also for iterators. I only see how we can move backwards from readable
> code. Nor is it good OOP practice. Inheritance is not *always* bad,
> though I agree it is used too quickly (in general).
>
> > In this case, except initialize, only a method named "graph" is added.
> > I think it is good to add the graph method in the Report class and
> > using an Array for storing PositiveSite objects.
>
> This is awful. The graph is a feature of PositiveSites, and not of the
> report *parser*. To keep things simple it is best practise to have
> functionality where it belongs. It is good OOP design. Your proposal
> means the Report class becomes less obvious in what it is. Look how
> clean it is now!
>
> What do other people think on this list. I am at a disadvantage here.
>
> I would like this code accepted in Bioruby, so other people can use
> it. I disagree with most of above 'criticism'. I certainly balk at the
> last non-OOP ones. This is not the first time I am really unhappy. I
> can't believe how much trouble I have to go to for a simple class,
> which, as it happens, has a perfectly acceptable implementation by
> most measures.
>
> Pj.
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>


From pjotr.public14 at thebird.nl  Tue Jan  5 12:39:02 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Tue, 5 Jan 2010 13:39:02 +0100
Subject: [BioRuby] Clustal ALN writer
Message-ID: <20100105123902.GA10823@thebird.nl>

I propose to write an ALN output writer. ALN files show aligned
sequences with additional lines of information (like a match line). I
want to use it to output PAML positive selection sites. This is
the idea:


SEQ1  alignment 1...
SEQ2  alignment 2...
      ...*.:*....***  (match line)
      ...*....*.....  (pos. sel. line)

Do we want such ALN output (I think it is allowed), and can we allow
for the additional output. I have a proposed interface here:
 
  http://github.com/pjotrp/bioruby/commit/7f320781039b56aee991ab72404655fae210e2cb

I notice ClustalW.to_fasta has been obsoleted. But we don't have
to_aln yet, and we need to allow adding match_lines and other
information.

Pj.


From ngoto at gen-info.osaka-u.ac.jp  Tue Jan  5 13:20:24 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Tue, 5 Jan 2010 22:20:24 +0900
Subject: [BioRuby] Codeml parser
In-Reply-To: <20100105103212.GA4584@thebird.nl>
References: <20091231141546.GA5770@thebird.nl>
	<20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp>
	<20100104115105.GA21035@thebird.nl>
	<20100104170904.GA26187@thebird.nl>
	<20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp>
	<20100105103212.GA4584@thebird.nl>
Message-ID: <20100105132025.B532E1CBC40C@idnmail.gen-info.osaka-u.ac.jp>

Hi Pjotr,

On Tue, 5 Jan 2010 11:32:12 +0100
Pjotr Prins <pjotr.public14 at thebird.nl> wrote:

> Hi Naohisa,
> 
> First I thought you were kidding. But then I realise you are serious.
> 
> I don't think we need to document every simple class variable/accessor
> to accept this source code. That is overkill. If you don't understand
> lnL or alpha, don't use it. We are not in the business of documenting
> for documenting's sake.  Documenting lnL and alpha will be like:
> 
> "Retrieve the lnL value from the Report" 
> 
> "Retrieve the alpha value from the Report" 
> 
> etc. etc. I don't think we should be doing that. Standard 1-to-1
> relations are obvious and don't need lots of text in the code base.

Even just one word is OK, e.g. "lnL", "alpha".
But no RDoc is not allowed.

Ideally, it may be really great if well informative description
can help people unfamiliar with Codeml, and this may encourage
people beginning to use Codeml with BioRuby. I understand this
can not be easily achieved. When writing a new class or largely
adding codes, it is also good to implement first with least
documentation and later to improve documents gradually.

> If someone feels like filling in these obvious statements, fine. It
> really goes against my grain. Do we document every single accessor?
> Note the previous implementation did no such thing. That code was
> accepted fine (and partially written by you).

In late 2005, we determined that all methods, attributes, classes,
modules, etc. should be documented by using RDoc. Codes written
before earlier 2006 may have no RDoc. I'm working to add RDoc in
such codes gradually, but not finished yet.

> > Details of +buf+ (class, contents, etc) should also be written in RDoc.
> > It is recommended to use the style written in the README_DEV.rdoc, or
> > the style used in the Ruby source code.
> 
> You mean the contents of the input buffer, which is the content of the
> input file? I see many places in Bioruby where no such a thing is
> done.  Why become strict on this now? If you want a different
> descriptive name for the variable - that is fine. Propose me
> a better name.

No need to change the variable name. I mean I want to clarify
that it points contents of the file and not filename.
If you think current description is enough apparent, it is OK.

> > >      def to_a
> > >        [ @position, @aaref, @probability, @omega ]
> > >      end
> > What is the purpose of the method?
> 
> Access converter. Convenience, really. You can remove it if you
> dislike it so much. I use it for testing and to write to a file. Could
> be to_s too, but that fixates the format.

OK if you feel useful.

> > >    class PositiveSites < Array
> > 
> > To inherit Array and to create original container class is discouraged.
> > In BioRuby, we have deprecated Bio::Features and Bio::References in
> > version 1.3.0, although they do not inherit Array but have an array
> > in the object. (The classes still exist only for backward compatibility,
> > in lib/bio/compat/features.rb and references.rb).
> 
> PositiveSites object has the all the features of a list (ie Array). I
> think inheritance is what it should be. It is an is_a relationship.
> Adding a @list will just add code. Not only for initialization, but
> also for iterators. I only see how we can move backwards from readable
> code. Nor is it good OOP practice. Inheritance is not *always* bad,
> though I agree it is used too quickly (in general).
> 
> > In this case, except initialize, only a method named "graph" is added.
> > I think it is good to add the graph method in the Report class and
> > using an Array for storing PositiveSite objects.
> 
> This is awful. The graph is a feature of PositiveSites, and not of the
> report *parser*. To keep things simple it is best practise to have
> functionality where it belongs. It is good OOP design. Your proposal
> means the Report class becomes less obvious in what it is. Look how
> clean it is now!

I respect your design if the class is not only a container of
PositiveSite objects but also having methods doing special things
by using relations among two or more objects which is not a simple
accumulation of each object's information.

> What do other people think on this list. I am at a disadvantage here.
>
> I would like this code accepted in Bioruby, so other people can use
> it. I disagree with most of above 'criticism'. I certainly balk at the
> last non-OOP ones. This is not the first time I am really unhappy. I
> can't believe how much trouble I have to go to for a simple class,
> which, as it happens, has a perfectly acceptable implementation by
> most measures.
> 
> Pj.
> 

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org


From ngoto at gen-info.osaka-u.ac.jp  Tue Jan  5 13:28:28 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Tue, 5 Jan 2010 22:28:28 +0900
Subject: [BioRuby] Clustal ALN writer
In-Reply-To: <20100105123902.GA10823@thebird.nl>
References: <20100105123902.GA10823@thebird.nl>
Message-ID: <20100105132829.27F7B1CBC401@idnmail.gen-info.osaka-u.ac.jp>

Hi Pjotr,

There is already Bio::Alignment#output_clustal method.
It is implemented in Bio::Alignment::Output module.

http://bioruby.org/rdoc/classes/Bio/Alignment/Output.html#M000092

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

On Tue, 5 Jan 2010 13:39:02 +0100
Pjotr Prins <pjotr.public14 at thebird.nl> wrote:

> I propose to write an ALN output writer. ALN files show aligned
> sequences with additional lines of information (like a match line). I
> want to use it to output PAML positive selection sites. This is
> the idea:
> 
> 
> SEQ1  alignment 1...
> SEQ2  alignment 2...
>       ...*.:*....***  (match line)
>       ...*....*.....  (pos. sel. line)
> 
> Do we want such ALN output (I think it is allowed), and can we allow
> for the additional output. I have a proposed interface here:
>  
>   http://github.com/pjotrp/bioruby/commit/7f320781039b56aee991ab72404655fae210e2cb
> 
> I notice ClustalW.to_fasta has been obsoleted. But we don't have
> to_aln yet, and we need to allow adding match_lines and other
> information.
> 
> Pj.
> 


From pjotr.public14 at thebird.nl  Tue Jan  5 17:04:34 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Tue, 5 Jan 2010 18:04:34 +0100
Subject: [BioRuby] Codeml parser
In-Reply-To: <20100105132025.B532E1CBC40C@idnmail.gen-info.osaka-u.ac.jp>
References: <20091231141546.GA5770@thebird.nl>
	<20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp>
	<20100104115105.GA21035@thebird.nl>
	<20100104170904.GA26187@thebird.nl>
	<20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp>
	<20100105103212.GA4584@thebird.nl>
	<20100105132025.B532E1CBC40C@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <20100105170434.GB13498@thebird.nl>

Hi Naohisa,

Thanks for clarifying. I am happy now.

Pj.


From pjotr.public14 at thebird.nl  Tue Jan  5 17:09:25 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Tue, 5 Jan 2010 18:09:25 +0100
Subject: [BioRuby] Clustal ALN writer
In-Reply-To: <20100105132829.27F7B1CBC401@idnmail.gen-info.osaka-u.ac.jp>
References: <20100105123902.GA10823@thebird.nl>
	<20100105132829.27F7B1CBC401@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <20100105170925.GA13828@thebird.nl>

On Tue, Jan 05, 2010 at 10:28:28PM +0900, Naohisa GOTO wrote:
> Hi Pjotr,
> 
> There is already Bio::Alignment#output_clustal method.
> It is implemented in Bio::Alignment::Output module.
> 
> http://bioruby.org/rdoc/classes/Bio/Alignment/Output.html#M000092

I missed that. Still it has no functionality for adding the
match_line, nor for adding extra information lines. Can I modify this
to give this method an optional parameter (list of String) for this?

The Alignment class is not aware of 'imported' match lines (it is Clustal
specific in Bioruby at this stage). 

How do you suppose we can do this so I can generate the ALN with
multiple match lines?

Pj.


From ngoto at gen-info.osaka-u.ac.jp  Wed Jan  6 03:31:25 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Wed, 6 Jan 2010 12:31:25 +0900
Subject: [BioRuby] Clustal ALN writer
In-Reply-To: <20100105170925.GA13828@thebird.nl>
References: <20100105123902.GA10823@thebird.nl>
	<20100105132829.27F7B1CBC401@idnmail.gen-info.osaka-u.ac.jp>
	<20100105170925.GA13828@thebird.nl>
Message-ID: <20100106033125.754941CBC3D4@idnmail.gen-info.osaka-u.ac.jp>

Hi,

On Tue, 5 Jan 2010 18:09:25 +0100
Pjotr Prins <pjotr.public14 at thebird.nl> wrote:

> On Tue, Jan 05, 2010 at 10:28:28PM +0900, Naohisa GOTO wrote:
> > Hi Pjotr,
> > 
> > There is already Bio::Alignment#output_clustal method.
> > It is implemented in Bio::Alignment::Output module.
> > 
> > http://bioruby.org/rdoc/classes/Bio/Alignment/Output.html#M000092
> 
> I missed that. Still it has no functionality for adding the
> match_line, nor for adding extra information lines. Can I modify this
> to give this method an optional parameter (list of String) for this?
>
> The Alignment class is not aware of 'imported' match lines (it is Clustal
> specific in Bioruby at this stage). 

The output_clustal method gets an argument named "options" as a Hash.
The match line can be altered by any given string with an option.

  alignment.output_clustal(:match_line => str)

I'm very sorry for incomplete documentation. It was first written
in 2003, and documents were added after 2005 but still incomplete.

Bio::Alignment#match_line method is the match line calculation 
method with the same algorithm as ClustalW.

> How do you suppose we can do this so I can generate the ALN with
> multiple match lines?

I'm afraid this is not regarded as Clustal format.
Of course, it is technically easy to add such function.

There may be many private extensions of Clustal format.
I think this is OK because Clustal format is rough,
although this makes hard to validate Clustal format.


Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org


From pjotr.public14 at thebird.nl  Wed Jan  6 08:07:10 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Wed, 6 Jan 2010 09:07:10 +0100
Subject: [BioRuby] Clustal ALN writer
In-Reply-To: <20100106033125.754941CBC3D4@idnmail.gen-info.osaka-u.ac.jp>
References: <20100105123902.GA10823@thebird.nl>
	<20100105132829.27F7B1CBC401@idnmail.gen-info.osaka-u.ac.jp>
	<20100105170925.GA13828@thebird.nl>
	<20100106033125.754941CBC3D4@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <20100106080710.GA23141@thebird.nl>

On Wed, Jan 06, 2010 at 12:31:25PM +0900, Naohisa GOTO wrote:
> > How do you suppose we can do this so I can generate the ALN with
> > multiple match lines?
> 
> I'm afraid this is not regarded as Clustal format.
> Of course, it is technically easy to add such function.
> 
> There may be many private extensions of Clustal format.
> I think this is OK because Clustal format is rough,
> although this makes hard to validate Clustal format.

Standards are vague. EMBOSS does not even mention the match line, but
as ClustalW generates it we assume it is a 'standard'. I think most
parsers basically ignore lines starting with white space. So multiple
'match lines' should normally work. Many standards in bioinformatics
evolve from use - maybe my idea will become a standard one day ;-).

I think it is a nice feature to have. I'll add a warning that one
should use it with caution.

BTW the ALN-writer should really live in its own class/module, similar
to the current layout for the 'Report' class (which in reality is an
ALN parser, or ALN-reader). It is no surprise I did not find either of
them when I was looking for an implementation.

OK, I'll cook something up in a separate git branch.

Pj.


From mail at michaelbarton.me.uk  Wed Jan  6 16:58:01 2010
From: mail at michaelbarton.me.uk (Michael Barton)
Date: Wed, 6 Jan 2010 11:58:01 -0500
Subject: [BioRuby] Codeml parser
In-Reply-To: <4c7507a71001050353v4e4654abn79f9cbcaeef262e2@mail.gmail.com>
References: <20091231141546.GA5770@thebird.nl>
	<20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp> 
	<20100104115105.GA21035@thebird.nl> <20100104170904.GA26187@thebird.nl>
	<20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp> 
	<20100105103212.GA4584@thebird.nl>
	<4c7507a71001050353v4e4654abn79f9cbcaeef262e2@mail.gmail.com>
Message-ID: <c27b73c1001060858u4598b22an14ee320a46ec1a0@mail.gmail.com>

2010/1/5 Jan Aerts <jan.aerts at gmail.com>:
> It appears that the pre-hackathon bioruby meeting will be very useful :-)
> Why don't we use that time to focus on the bit-more-distant future of
> bioruby: bioruby 2.0? We could discuss what it should look like without
> having to worry about backward compatibility.

I second what Jan has suggested about the direction of BioRuby and
version 2.0. As Ruby becomes more popular a programming language in
bioinformatics it might be expected that BioRuby will receive more and
more contributions. Prior to BioRuby 2.0 might be a nice time to
discuss how BioRuby will grow and be organised as it increases in
size.

Topics:
> * documentation style (I happen to agree with Naohisa on that)
> * class hierarchy: how would we organize the information if we had to start
> from scratch? (maybe we should follow bioperl's lead with a Root class?)
> * coding style
> * general interface decisions
> * ...
>
> jan.
>
> PS: Still don't know if I can make it to Japan. Will know this afternoon
> (broken foot might interfere...)
>
> 2010/1/5 Pjotr Prins <pjotr.public14 at thebird.nl>
>
>> Hi Naohisa,
>>
>> First I thought you were kidding. But then I realise you are serious.
>>
>> I don't think we need to document every simple class variable/accessor
>> to accept this source code. That is overkill. If you don't understand
>> lnL or alpha, don't use it. We are not in the business of documenting
>> for documenting's sake. ?Documenting lnL and alpha will be like:
>>
>> "Retrieve the lnL value from the Report"
>>
>> "Retrieve the alpha value from the Report"
>>
>> etc. etc. I don't think we should be doing that. Standard 1?to-1
>> relations are obvious and don't need lots of text in the code base.
>>
>> If someone feels like filling in these obvious statements, fine. It
>> really goes against my grain. Do we document every single accessor?
>> Note the previous implementation did no such thing. That code was
>> accepted fine (and partially written by you).
>>
>> > Details of +buf+ (class, contents, etc) should also be written in RDoc.
>> > It is recommended to use the style written in the README_DEV.rdoc, or
>> > the style used in the Ruby source code.
>>
>> You mean the contents of the input buffer, which is the content of the
>> input file? I see many places in Bioruby where no such a thing is
>> done. ?Why become strict on this now? If you want a different
>> descriptive name for the variable - that is fine. Propose me
>> a better name.
>>
>> > > ? ? ?def to_a
>> > > ? ? ? ?[ @position, @aaref, @probability, @omega ]
>> > > ? ? ?end
>> > What is the purpose of the method?
>>
>> Access converter. Convenience, really. You can remove it if you
>> dislike it so much. I use it for testing and to write to a file. Could
>> be to_s too, but that fixates the format.
>>
>> > > ? ?class PositiveSites < Array
>> >
>> > To inherit Array and to create original container class is discouraged.
>> > In BioRuby, we have deprecated Bio::Features and Bio::References in
>> > version 1.3.0, although they do not inherit Array but have an array
>> > in the object. (The classes still exist only for backward compatibility,
>> > in lib/bio/compat/features.rb and references.rb).
>>
>> PositiveSites object has the all the features of a list (ie Array). I
>> think inheritance is what it should be. It is an is_a relationship.
>> Adding a @list will just add code. Not only for initialization, but
>> also for iterators. I only see how we can move backwards from readable
>> code. Nor is it good OOP practice. Inheritance is not *always* bad,
>> though I agree it is used too quickly (in general).
>>
>> > In this case, except initialize, only a method named "graph" is added.
>> > I think it is good to add the graph method in the Report class and
>> > using an Array for storing PositiveSite objects.
>>
>> This is awful. The graph is a feature of PositiveSites, and not of the
>> report *parser*. To keep things simple it is best practise to have
>> functionality where it belongs. It is good OOP design. Your proposal
>> means the Report class becomes less obvious in what it is. Look how
>> clean it is now!
>>
>> What do other people think on this list. I am at a disadvantage here.
>>
>> I would like this code accepted in Bioruby, so other people can use
>> it. I disagree with most of above 'criticism'. I certainly balk at the
>> last non-OOP ones. This is not the first time I am really unhappy. I
>> can't believe how much trouble I have to go to for a simple class,
>> which, as it happens, has a perfectly acceptable implementation by
>> most measures.
>>
>> Pj.
>>
>> _______________________________________________
>> BioRuby Project - http://www.bioruby.org/
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
>>
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>


From jan.aerts at gmail.com  Fri Jan  8 16:29:07 2010
From: jan.aerts at gmail.com (Jan Aerts)
Date: Fri, 8 Jan 2010 16:29:07 +0000
Subject: [BioRuby] Codeml parser
In-Reply-To: <4c7507a71001050353v4e4654abn79f9cbcaeef262e2@mail.gmail.com>
References: <20091231141546.GA5770@thebird.nl>
	<20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp>
	<20100104115105.GA21035@thebird.nl>
	<20100104170904.GA26187@thebird.nl>
	<20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp>
	<20100105103212.GA4584@thebird.nl>
	<4c7507a71001050353v4e4654abn79f9cbcaeef262e2@mail.gmail.com>
Message-ID: <4c7507a71001080829u179e48f9jfc81b0590fd34872@mail.gmail.com>

Maybe it'd be a good idea to start thinking at a level removed from actual
code, and create some general design documents first. Maybe we should
* describe what we actually want to achieve with the bioruby toolkit: should
it be a library foremost, or should it rather be an interface to run other
programs (e.g. BLAST)?
* make a high-level overview of different parts of bioruby:
  - how do we handle file formats: are the files actual objects, or do they
merely describe a biological entity? E.g. does a FASTA file merit the
instantiation of a FASTA object, or is it nothing more than a container of
Sequence objects?
  - how do different parts of the library interact? Should we have a Root
class such as in bioperl? What type of class should be used to interface
with the world (e.g. file parsing)? What type of class should be used to
actually contain the object data (e.g. annotated sequence)?

When that's done: come up with general guidelines for coding, e.g. always
use keyword-based argument lists or something (just an example).

jan.

2010/1/5 Jan Aerts <jan.aerts at gmail.com>

> All,
>
> It appears that the pre-hackathon bioruby meeting will be very useful :-)
> Why don't we use that time to focus on the bit-more-distant future of
> bioruby: bioruby 2.0? We could discuss what it should look like without
> having to worry about backward compatibility. Topics:
> * documentation style (I happen to agree with Naohisa on that)
> * class hierarchy: how would we organize the information if we had to start
> from scratch? (maybe we should follow bioperl's lead with a Root class?)
> * coding style
> * general interface decisions
> * ...
>
> jan.
>
> PS: Still don't know if I can make it to Japan. Will know this afternoon
> (broken foot might interfere...)
>
> 2010/1/5 Pjotr Prins <pjotr.public14 at thebird.nl>
>
> Hi Naohisa,
>>
>> First I thought you were kidding. But then I realise you are serious.
>>
>> I don't think we need to document every simple class variable/accessor
>> to accept this source code. That is overkill. If you don't understand
>> lnL or alpha, don't use it. We are not in the business of documenting
>> for documenting's sake.  Documenting lnL and alpha will be like:
>>
>> "Retrieve the lnL value from the Report"
>>
>> "Retrieve the alpha value from the Report"
>>
>> etc. etc. I don't think we should be doing that. Standard 1?to-1
>> relations are obvious and don't need lots of text in the code base.
>>
>> If someone feels like filling in these obvious statements, fine. It
>> really goes against my grain. Do we document every single accessor?
>> Note the previous implementation did no such thing. That code was
>> accepted fine (and partially written by you).
>>
>> > Details of +buf+ (class, contents, etc) should also be written in RDoc.
>> > It is recommended to use the style written in the README_DEV.rdoc, or
>> > the style used in the Ruby source code.
>>
>> You mean the contents of the input buffer, which is the content of the
>> input file? I see many places in Bioruby where no such a thing is
>> done.  Why become strict on this now? If you want a different
>> descriptive name for the variable - that is fine. Propose me
>> a better name.
>>
>> > >      def to_a
>> > >        [ @position, @aaref, @probability, @omega ]
>> > >      end
>> > What is the purpose of the method?
>>
>> Access converter. Convenience, really. You can remove it if you
>> dislike it so much. I use it for testing and to write to a file. Could
>> be to_s too, but that fixates the format.
>>
>> > >    class PositiveSites < Array
>> >
>> > To inherit Array and to create original container class is discouraged.
>> > In BioRuby, we have deprecated Bio::Features and Bio::References in
>> > version 1.3.0, although they do not inherit Array but have an array
>> > in the object. (The classes still exist only for backward compatibility,
>> > in lib/bio/compat/features.rb and references.rb).
>>
>> PositiveSites object has the all the features of a list (ie Array). I
>> think inheritance is what it should be. It is an is_a relationship.
>> Adding a @list will just add code. Not only for initialization, but
>> also for iterators. I only see how we can move backwards from readable
>> code. Nor is it good OOP practice. Inheritance is not *always* bad,
>> though I agree it is used too quickly (in general).
>>
>> > In this case, except initialize, only a method named "graph" is added.
>> > I think it is good to add the graph method in the Report class and
>> > using an Array for storing PositiveSite objects.
>>
>> This is awful. The graph is a feature of PositiveSites, and not of the
>> report *parser*. To keep things simple it is best practise to have
>> functionality where it belongs. It is good OOP design. Your proposal
>> means the Report class becomes less obvious in what it is. Look how
>> clean it is now!
>>
>> What do other people think on this list. I am at a disadvantage here.
>>
>> I would like this code accepted in Bioruby, so other people can use
>> it. I disagree with most of above 'criticism'. I certainly balk at the
>> last non-OOP ones. This is not the first time I am really unhappy. I
>> can't believe how much trouble I have to go to for a simple class,
>> which, as it happens, has a perfectly acceptable implementation by
>> most measures.
>>
>> Pj.
>>
>> _______________________________________________
>> BioRuby Project - http://www.bioruby.org/
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
>>
>
>


From pjotr.public14 at thebird.nl  Fri Jan  8 17:21:32 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Fri, 8 Jan 2010 18:21:32 +0100
Subject: [BioRuby] Codeml parser
In-Reply-To: <4c7507a71001080829u179e48f9jfc81b0590fd34872@mail.gmail.com>
References: <20091231141546.GA5770@thebird.nl>
	<20100104071518.EC6721CBC419@idnmail.gen-info.osaka-u.ac.jp>
	<20100104115105.GA21035@thebird.nl>
	<20100104170904.GA26187@thebird.nl>
	<20100105074249.BAB1D1CBC401@idnmail.gen-info.osaka-u.ac.jp>
	<20100105103212.GA4584@thebird.nl>
	<4c7507a71001050353v4e4654abn79f9cbcaeef262e2@mail.gmail.com>
	<4c7507a71001080829u179e48f9jfc81b0590fd34872@mail.gmail.com>
Message-ID: <20100108172132.GA28895@thebird.nl>

On Fri, Jan 08, 2010 at 04:29:07PM +0000, Jan Aerts wrote:
> Maybe it'd be a good idea to start thinking at a level removed from actual
> code, and create some general design documents first. Maybe we should
> * describe what we actually want to achieve with the bioruby toolkit: should
> it be a library foremost, or should it rather be an interface to run other
> programs (e.g. BLAST)?

I think calling into other programs is a good feature, but should be
really split out. Likewise for web services. Both split in terms of
objects and directory layout. Currently there is too intertwined
functionality.

Then there is support for reading and writing standard formats.

Then there is extra functionality (not found elsewhere, perhaps).

And we have Rails support and the shell.

All these should be clearly split out.

I don't think we have to choose. We can have it all. Just make sure
it sits in the right location.

> * make a high-level overview of different parts of bioruby:
>   - how do we handle file formats: are the files actual objects, or do they
> merely describe a biological entity? E.g. does a FASTA file merit the
> instantiation of a FASTA object, or is it nothing more than a container of
> Sequence objects?
>   - how do different parts of the library interact? Should we have a Root
> class such as in bioperl? What type of class should be used to interface
> with the world (e.g. file parsing)? What type of class should be used to
> actually contain the object data (e.g. annotated sequence)?
> 
> When that's done: come up with general guidelines for coding, e.g. always
> use keyword-based argument lists or something (just an example).

These choices are design choices and have to originate in a list of
shared 'values'. Because if we don't agree on a value there will
always be arguments and disagreement. One value would be 'clear
documentation', but this may collide with 'clear source code'.
Similarly 'Easy to use code' and 'Concise code' may collide. Or
functional choices over OOP. We need to put those values together and
rank them in importance. Once the ranking is set we can make easy
choices in guidelines.

I am writing a type of Manifest. I'll present that in the coming
weeks, when I feel I am ready. It is meant for discussion in Japan,
and after.

Pj.


From pjotr.public14 at thebird.nl  Mon Jan 11 14:40:41 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Mon, 11 Jan 2010 15:40:41 +0100
Subject: [BioRuby] Clustal ALN writer
Message-ID: <20100111144041.GA31684@thebird.nl>

I have created an colorized HTML alignment file with consensus
information and amino acids showing evidence of positive selection
(based on PAML output).

  http://thebird.nl/projects/test_color2.html

I did a write up on the implementation at:

  http://bioruby.open-bio.org/wiki/BIORUBY_ALNCOLOR

Enjoy,

Pj.


From ngoto at gen-info.osaka-u.ac.jp  Tue Jan 12 09:29:57 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Tue, 12 Jan 2010 18:29:57 +0900
Subject: [BioRuby] Clustal ALN writer
In-Reply-To: <20100111144041.GA31684@thebird.nl>
References: <20100111144041.GA31684@thebird.nl>
Message-ID: <20100112092957.A16001CBC49E@idnmail.gen-info.osaka-u.ac.jp>

Hi,

I'm not sure whether the prefix Bio::Html is suitable or not.

By the way, I'v tried some of your code in
http://github.com/pjotrp/bioruby/blob/color-alignment/
and found potential XSS.

  a = Bio::Alignment.new
  a.add_seq('ATCCATGG', '<script>alert("a");</script>')
  a.add_seq('ATGCATGC', '<script>alert("b");</script>')
  a.add_seq('<script>alert("c");</script>', 'c')
  simple = Bio::Html::HtmlAlignment.new(a,
          :title => '<script>alert("title");</script>')
  html = simple.html()
  File.open('/tmp/xss.html', 'w') { |w| w.print html }

For sequences, sequence names, and consensus lines,
using CGI.escapeHTML() will always be needed.

For the :title, if script users can set the title, it
should be escaped, but this prevents script programmers
using html tags in the title.

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

On Mon, 11 Jan 2010 15:40:41 +0100
Pjotr Prins <pjotr.public14 at thebird.nl> wrote:

> I have created an colorized HTML alignment file with consensus
> information and amino acids showing evidence of positive selection
> (based on PAML output).
> 
>   http://thebird.nl/projects/test_color2.html
> 
> I did a write up on the implementation at:
> 
>   http://bioruby.open-bio.org/wiki/BIORUBY_ALNCOLOR
> 
> Enjoy,
> 
> Pj.
> 
> 
> 
> 


From pjotr.public14 at thebird.nl  Tue Jan 12 10:11:32 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Tue, 12 Jan 2010 11:11:32 +0100
Subject: [BioRuby] Bioruby HTML output
Message-ID: <20100112101132.GC10308@thebird.nl>

On Tue, Jan 12, 2010 at 06:29:57PM +0900, Naohisa GOTO wrote:
> I'm not sure whether the prefix Bio::Html is suitable or not.

Me neither ;). This is something to discuss when we meet. See my
write up on partitioning based on functionality or standards.

> By the way, I'v tried some of your code in
> http://github.com/pjotrp/bioruby/blob/color-alignment/
> and found potential XSS.
> 
>   a = Bio::Alignment.new
>   a.add_seq('ATCCATGG', '<script>alert("a");</script>')
>   a.add_seq('ATGCATGC', '<script>alert("b");</script>')
>   a.add_seq('<script>alert("c");</script>', 'c')
>   simple = Bio::Html::HtmlAlignment.new(a,
>           :title => '<script>alert("title");</script>')
>   html = simple.html()
>   File.open('/tmp/xss.html', 'w') { |w| w.print html }
> 
> For sequences, sequence names, and consensus lines,
> using CGI.escapeHTML() will always be needed.
>
> For the :title, if script users can set the title, it
> should be escaped, but this prevents script programmers
> using html tags in the title.

Perhaps the HTML generator should escape its output. Though I
personally think we should only be worried about security concerns
when people *enter* new data on input forms. That is when exploits
show up. I can argue that HTML generation should not concern itself
with HOW the inputs are presented. One advantage of having a
programmer set the 'title' is that he *can* embed HTML. Perhaps
escaping HTML is the responsibility of the programmer providing the
data. And therefore to the logic that handles input.

We have had a similar discussion before. We have to decide to what
level *output* code should concern itself with *input* security. I
have a feeling that too much of Bioruby classes try to do too much.
How do we stay away from cluttering the code? How do we decide that
callers should not use HTML and handle security concerns?

You write:

>   a.add_seq('ATCCATGG', '<script>alert("a");</script>')

If a programmer wants that - it is his concern in my opion. If he is
concerned about exploits he should not allow it. The Alignment class
does not care either. It is none of its business.

BTW I fixed a number of PAML::Codeml bugs on this branch. So you
can ignore the existing PAML branch. Let's continue with the color
coding, assuming you can live with the PAML::Codeml implementation,
as it stands.

Pj.


From donttrustben at gmail.com  Tue Jan 12 12:52:42 2010
From: donttrustben at gmail.com (Ben Woodcroft)
Date: Tue, 12 Jan 2010 22:52:42 +1000
Subject: [BioRuby] SPTR problem
Message-ID: <bb2b67d01001120452t2c9bd748v1992b6d1812d418@mail.gmail.com>

Hi,

While parsing all the yeast UniProt txt files I came across a problem with
the gn parser - it was returning an array when I expected a hash. Looking at
the code the problem seems to be this when statement:

      when /Name=/,/ORFNames=/
        @data['GN'] = gn_uniprot_parser
      else
        @data['GN'] = gn_old_parser
      end

http://www.uniprot.org/uniprot/A2P2R3.txt has the problem on the 5th line:

GN OrderedLocusNames=YMR084W;

So GN line had OrderedLocusNames= but not  Name= or ORFNames=, so it didn't
use the new parser, like the other entries I came across. Should all 4
possibilities be tested for in the when statement: (Synonyms= being the
4th)?

Also, while I'm here:
* why does the returned hash have different keys than are in the file? e.g.
ORFNames becomes :orfs?
* I also found the parsing process for whole genomes quite slow (multiple
hours for well annotated ones).
* is there any standard way to handle concatenated UniProt files? I wrote my
own as it was simple.

Thanks,
ben

--
FYI: My email addresses at unimelb, uq and gmail all redirect to the same
place.


From ngoto at gen-info.osaka-u.ac.jp  Wed Jan 13 02:58:00 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Wed, 13 Jan 2010 11:58:00 +0900
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <20100112101132.GC10308@thebird.nl>
References: <20100112101132.GC10308@thebird.nl>
Message-ID: <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>

Hi,

On Tue, 12 Jan 2010 11:11:32 +0100
Pjotr Prins <pjotr.public14 at thebird.nl> wrote:

> On Tue, Jan 12, 2010 at 06:29:57PM +0900, Naohisa GOTO wrote:
> > I'm not sure whether the prefix Bio::Html is suitable or not.
> 
> Me neither ;). This is something to discuss when we meet. See my
> write up on partitioning based on functionality or standards.
> 
> > By the way, I'v tried some of your code in
> > http://github.com/pjotrp/bioruby/blob/color-alignment/
> > and found potential XSS.
> > 
> >   a = Bio::Alignment.new
> >   a.add_seq('ATCCATGG', '<script>alert("a");</script>')
> >   a.add_seq('ATGCATGC', '<script>alert("b");</script>')
> >   a.add_seq('<script>alert("c");</script>', 'c')
> >   simple = Bio::Html::HtmlAlignment.new(a,
> >           :title => '<script>alert("title");</script>')
> >   html = simple.html()
> >   File.open('/tmp/xss.html', 'w') { |w| w.print html }
> > 
> > For sequences, sequence names, and consensus lines,
> > using CGI.escapeHTML() will always be needed.
> >
> > For the :title, if script users can set the title, it
> > should be escaped, but this prevents script programmers
> > using html tags in the title.
> 
> Perhaps the HTML generator should escape its output. Though I
> personally think we should only be worried about security concerns
> when people *enter* new data on input forms. That is when exploits
> show up. I can argue that HTML generation should not concern itself
> with HOW the inputs are presented. One advantage of having a
> programmer set the 'title' is that he *can* embed HTML. Perhaps
> escaping HTML is the responsibility of the programmer providing the
> data. And therefore to the logic that handles input.

Even apart from security, sequence names (and sequences) that
contain html special characters may not be correctly displayed.

For example, sequences with three parameters a, b, and c.

% cat test.aln
CLUSTAL 2.0.9 multiple sequence alignment


1<a<3_b>5_c<7       FKNVFTVIKTEFNSHRVKIDSHFHGIWIAGAPPEGTDVYIKWFLQ
a>3_5<b<8_c>11      FKNVMSVIKTEFNSHRVKIDSHFHGIWIAGAPPEGTDVYIKTFLQ
                    ****::*********************************** ***
% irb -r bio
irb> report = Bio::ClustalW::Report.new(File.read('test.aln'))
irb> alignment = report.alignment
irb> simple = Bio::Html::HtmlAlignment.new(alignment, :title => 'a,b,c')
irb> File.open('abc.html', 'w') { |w| w.print simple.html() }

The sequence names were correctly treated by ClustalW 2.0.9,
but unexpected representation.

This problem can not be solved with input data escaping.
If the sequence name "1<a<3_b>5_c<7" is escaped to
"1&lt;a&lt;3_b&gt;5_c&lt;7" before calling the method,
text indentation will be broken because of the mismatch of
text length and html display width. To solve this, to
escape when building the html format by output formatting
method will be needed.

> We have had a similar discussion before. We have to decide to what
> level *output* code should concern itself with *input* security. I
> have a feeling that too much of Bioruby classes try to do too much.
> How do we stay away from cluttering the code? How do we decide that
> callers should not use HTML and handle security concerns?

It is difficult not to use HTML-like string which we want
to be treated as normal unformatted string but unexpectedly
treated as HTML by some programs, e.g. the above example.

For security, I'd like to ask security experts.
Anyone in this list?

I think escaping should be done by formatting layer and
should be turned on by default, because:
* Only the output formatting layer knows how the input data
  is processed.
* In many cases, the data comes from outside, and we can not
  expect it is safe enough.
* Different escaping rules are needed for different output types,
  e.g. SQL, html, XML, TeX, GNUPLOT, PostScript, R scripts.
  Escaping by output methods seems natural, and helps to switch
  output formats without concerning escaping issues specific
  to each output format.

> You write:
> 
> >   a.add_seq('ATCCATGG', '<script>alert("a");</script>')
> 
> If a programmer wants that - it is his concern in my opion. If he is
> concerned about exploits he should not allow it. The Alignment class
> does not care either. It is none of its business.

The example is extreme case. For security, please ask experts.
Apart from the security, I wish ">", "<", "&", etc. can be
displayed correctly. I think methods to build HTML format
should concern this.

> BTW I fixed a number of PAML::Codeml bugs on this branch. So you
> can ignore the existing PAML branch. Let's continue with the color
> coding, assuming you can live with the PAML::Codeml implementation,
> as it stands.

When do you want the Bio::PAML::Codeml code to be merged to the
blessed bioruby repository?


Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org


From tomoakin at kenroku.kanazawa-u.ac.jp  Wed Jan 13 06:57:11 2010
From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA)
Date: Wed, 13 Jan 2010 15:57:11 +0900
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
References: <20100112101132.GC10308@thebird.nl>
	<20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>

Hi, Happy New Year!

> For security, I'd like to ask security experts.
> Anyone in this list?

Though I am not an expert, in a Japanese blog,
http://takagi-hiromitsu.jp/diary/20051227.html
Hiromitsu Takagi writes the reason why escaping should be default at  
the output point,
from a security points, which sounds me reasonable, though I do not  
know an english
literature.

In addition,

> * Different escaping rules are needed for different output types,
>   e.g. SQL, html, XML, TeX, GNUPLOT, PostScript, R scripts.
>   Escaping by output methods seems natural, and helps to switch
>   output formats without concerning escaping issues specific
>   to each output format.


this is a good argument.
If html tag containing title is necessary, a non-default API that  
does accept
html marked text rather than the normal text should be considered.
-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan


On 2010/01/13, at 11:58, Naohisa GOTO wrote:

> Hi,
>
> On Tue, 12 Jan 2010 11:11:32 +0100
> Pjotr Prins <pjotr.public14 at thebird.nl> wrote:
>
>> On Tue, Jan 12, 2010 at 06:29:57PM +0900, Naohisa GOTO wrote:
>>> I'm not sure whether the prefix Bio::Html is suitable or not.
>>
>> Me neither ;). This is something to discuss when we meet. See my
>> write up on partitioning based on functionality or standards.
>>
>>> By the way, I'v tried some of your code in
>>> http://github.com/pjotrp/bioruby/blob/color-alignment/
>>> and found potential XSS.
>>>
>>>   a = Bio::Alignment.new
>>>   a.add_seq('ATCCATGG', '<script>alert("a");</script>')
>>>   a.add_seq('ATGCATGC', '<script>alert("b");</script>')
>>>   a.add_seq('<script>alert("c");</script>', 'c')
>>>   simple = Bio::Html::HtmlAlignment.new(a,
>>>           :title => '<script>alert("title");</script>')
>>>   html = simple.html()
>>>   File.open('/tmp/xss.html', 'w') { |w| w.print html }
>>>
>>> For sequences, sequence names, and consensus lines,
>>> using CGI.escapeHTML() will always be needed.
>>>
>>> For the :title, if script users can set the title, it
>>> should be escaped, but this prevents script programmers
>>> using html tags in the title.
>>
>> Perhaps the HTML generator should escape its output. Though I
>> personally think we should only be worried about security concerns
>> when people *enter* new data on input forms. That is when exploits
>> show up. I can argue that HTML generation should not concern itself
>> with HOW the inputs are presented. One advantage of having a
>> programmer set the 'title' is that he *can* embed HTML. Perhaps
>> escaping HTML is the responsibility of the programmer providing the
>> data. And therefore to the logic that handles input.
>
> Even apart from security, sequence names (and sequences) that
> contain html special characters may not be correctly displayed.
>
> For example, sequences with three parameters a, b, and c.
>
> % cat test.aln
> CLUSTAL 2.0.9 multiple sequence alignment
>
>
> 1<a<3_b>5_c<7       FKNVFTVIKTEFNSHRVKIDSHFHGIWIAGAPPEGTDVYIKWFLQ
> a>3_5<b<8_c>11      FKNVMSVIKTEFNSHRVKIDSHFHGIWIAGAPPEGTDVYIKTFLQ
>                     ****::*********************************** ***
> % irb -r bio
> irb> report = Bio::ClustalW::Report.new(File.read('test.aln'))
> irb> alignment = report.alignment
> irb> simple = Bio::Html::HtmlAlignment.new(alignment, :title =>  
> 'a,b,c')
> irb> File.open('abc.html', 'w') { |w| w.print simple.html() }
>
> The sequence names were correctly treated by ClustalW 2.0.9,
> but unexpected representation.
>
> This problem can not be solved with input data escaping.
> If the sequence name "1<a<3_b>5_c<7" is escaped to
> "1&lt;a&lt;3_b&gt;5_c&lt;7" before calling the method,
> text indentation will be broken because of the mismatch of
> text length and html display width. To solve this, to
> escape when building the html format by output formatting
> method will be needed.
>
>> We have had a similar discussion before. We have to decide to what
>> level *output* code should concern itself with *input* security. I
>> have a feeling that too much of Bioruby classes try to do too much.
>> How do we stay away from cluttering the code? How do we decide that
>> callers should not use HTML and handle security concerns?
>
> It is difficult not to use HTML-like string which we want
> to be treated as normal unformatted string but unexpectedly
> treated as HTML by some programs, e.g. the above example.
>
> For security, I'd like to ask security experts.
> Anyone in this list?
>
> I think escaping should be done by formatting layer and
> should be turned on by default, because:
> * Only the output formatting layer knows how the input data
>   is processed.
> * In many cases, the data comes from outside, and we can not
>   expect it is safe enough.
> * Different escaping rules are needed for different output types,
>   e.g. SQL, html, XML, TeX, GNUPLOT, PostScript, R scripts.
>   Escaping by output methods seems natural, and helps to switch
>   output formats without concerning escaping issues specific
>   to each output format.
>
>> You write:
>>
>>>   a.add_seq('ATCCATGG', '<script>alert("a");</script>')
>>
>> If a programmer wants that - it is his concern in my opion. If he is
>> concerned about exploits he should not allow it. The Alignment class
>> does not care either. It is none of its business.
>
> The example is extreme case. For security, please ask experts.
> Apart from the security, I wish ">", "<", "&", etc. can be
> displayed correctly. I think methods to build HTML format
> should concern this.
>
>> BTW I fixed a number of PAML::Codeml bugs on this branch. So you
>> can ignore the existing PAML branch. Let's continue with the color
>> coding, assuming you can live with the PAML::Codeml implementation,
>> as it stands.
>
> When do you want the Bio::PAML::Codeml code to be merged to the
> blessed bioruby repository?
>
>
> Naohisa Goto
> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>


From pjotr.public14 at thebird.nl  Wed Jan 13 07:37:06 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Wed, 13 Jan 2010 08:37:06 +0100
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
References: <20100112101132.GC10308@thebird.nl>
	<20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
	<3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
Message-ID: <20100113073706.GA25611@thebird.nl>

Hi all,

OK, I'll adapt the output generator to escape symbols. And I think
you are right it belongs in the generator. There are three scenario's
really:

1. Output that never contains symbols (sequence)
2. Output that can contain symbols, but should be escaped
(descriptions, id's)
3. Output that can contain HTML

In my case I have all three. 

I think with a sequence we can assume the content is a legal string.
Escaping is overkill and (if needed) points to a bigger problem. I
think we should not clutter the code with (1) - or degrade performance
by default.

Case (2) yes!

case (3), like a title or some text to plug in, we should escape by
default, but add a parameter :html_escape == false for the cases the user
wants to plug in HTML.

OK?

Pj.


From tomoakin at kenroku.kanazawa-u.ac.jp  Wed Jan 13 09:44:01 2010
From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA)
Date: Wed, 13 Jan 2010 18:44:01 +0900
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <20100113073706.GA25611@thebird.nl>
References: <20100112101132.GC10308@thebird.nl>
	<20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
	<3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
	<20100113073706.GA25611@thebird.nl>
Message-ID: <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>

Hi,

> I think with a sequence we can assume the content is a legal string.
> Escaping is overkill and (if needed) points to a bigger problem. I
> think we should not clutter the code with (1) - or degrade performance
> by default.


If we are talking on Bio::Html::HtmlAlignment,
it is better to escape even for sequence or matchlines to make
the class more independent of the implementation of alignment class.
Note that sim4 uses >>>...>>> in its matchline, and a future
intron aware amino acid alignment processing program might use
special characters to indicate introns.

If the performance is really a problem and
it is in Bio::Alignment::Output, and the constructor guarantees
that there is no special characters, then the escape may be skipped.
Escaping everything is the default simple program structure and
removing that process is a kind of optimization with some programming  
effort
to guarantee its validity without escaping.
-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan


On 2010/01/13, at 16:37, Pjotr Prins wrote:

> Hi all,
>
> OK, I'll adapt the output generator to escape symbols. And I think
> you are right it belongs in the generator. There are three scenario's
> really:
>
> 1. Output that never contains symbols (sequence)
> 2. Output that can contain symbols, but should be escaped
> (descriptions, id's)
> 3. Output that can contain HTML
>
> In my case I have all three.
>
> I think with a sequence we can assume the content is a legal string.
> Escaping is overkill and (if needed) points to a bigger problem. I
> think we should not clutter the code with (1) - or degrade performance
> by default.
>
> Case (2) yes!
>
> case (3), like a title or some text to plug in, we should escape by
> default, but add a parameter :html_escape == false for the cases  
> the user
> wants to plug in HTML.
>
> OK?
>
> Pj.
>


From pjotr.public14 at thebird.nl  Fri Jan 15 14:00:59 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Fri, 15 Jan 2010 15:00:59 +0100
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
References: <20100112101132.GC10308@thebird.nl>
	<20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
	<3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
	<20100113073706.GA25611@thebird.nl>
	<5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
Message-ID: <20100115140059.GA24948@thebird.nl>

On second thought, escaping is less obvious than I thought. I can
escape all generated HTML, but that leaves no way to customize the
output. Say I want to include an href in a sequence descriptor - which
is a fairly typical requirement - that would be disabled. Likewise if
someone wants to customize the title or footer - or even the
information on the match_line. 

The problem here is that we are defining use - forcing the generated
HTML into a straight jacket by adding business logic. 

Are we really telling our users not to use HTML in sequence
descriptors, even if it is tied to one type of output?

I don't like it.

I am going to add a 'master' switch for escaping of HTML. The default
will be with escaping.

Pj.

On Wed, Jan 13, 2010 at 06:44:01PM +0900, Tomoaki NISHIYAMA wrote:
> Hi,
>
>> I think with a sequence we can assume the content is a legal string.
>> Escaping is overkill and (if needed) points to a bigger problem. I
>> think we should not clutter the code with (1) - or degrade performance
>> by default.
>
>
> If we are talking on Bio::Html::HtmlAlignment,
> it is better to escape even for sequence or matchlines to make
> the class more independent of the implementation of alignment class.
> Note that sim4 uses >>>...>>> in its matchline, and a future
> intron aware amino acid alignment processing program might use
> special characters to indicate introns.
>
> If the performance is really a problem and
> it is in Bio::Alignment::Output, and the constructor guarantees
> that there is no special characters, then the escape may be skipped.
> Escaping everything is the default simple program structure and
> removing that process is a kind of optimization with some programming  
> effort
> to guarantee its validity without escaping.
> -- 
> Tomoaki NISHIYAMA
>
> Advanced Science Research Center,
> Kanazawa University,
> 13-1 Takara-machi,
> Kanazawa, 920-0934, Japan
>
>
> On 2010/01/13, at 16:37, Pjotr Prins wrote:
>
>> Hi all,
>>
>> OK, I'll adapt the output generator to escape symbols. And I think
>> you are right it belongs in the generator. There are three scenario's
>> really:
>>
>> 1. Output that never contains symbols (sequence)
>> 2. Output that can contain symbols, but should be escaped
>> (descriptions, id's)
>> 3. Output that can contain HTML
>>
>> In my case I have all three.
>>
>> I think with a sequence we can assume the content is a legal string.
>> Escaping is overkill and (if needed) points to a bigger problem. I
>> think we should not clutter the code with (1) - or degrade performance
>> by default.
>>
>> Case (2) yes!
>>
>> case (3), like a title or some text to plug in, we should escape by
>> default, but add a parameter :html_escape == false for the cases the 
>> user
>> wants to plug in HTML.
>>
>> OK?
>>
>> Pj.
>>
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From ngoto at gen-info.osaka-u.ac.jp  Fri Jan 15 17:19:12 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Sat, 16 Jan 2010 02:19:12 +0900
Subject: [BioRuby] SPTR problem
In-Reply-To: <bb2b67d01001120452t2c9bd748v1992b6d1812d418@mail.gmail.com>
References: <bb2b67d01001120452t2c9bd748v1992b6d1812d418@mail.gmail.com>
Message-ID: <20100115171913.4396B1CBC40C@idnmail.gen-info.osaka-u.ac.jp>

Hi,

On Tue, 12 Jan 2010 22:52:42 +1000
Ben Woodcroft <donttrustben at gmail.com> wrote:

> Hi,
> 
> While parsing all the yeast UniProt txt files I came across a problem with
> the gn parser - it was returning an array when I expected a hash. Looking at
> the code the problem seems to be this when statement:
> 
>       when /Name=/,/ORFNames=/
>         @data['GN'] = gn_uniprot_parser
>       else
>         @data['GN'] = gn_old_parser
>       end
> 
> http://www.uniprot.org/uniprot/A2P2R3.txt has the problem on the 5th line:
> 
> GN OrderedLocusNames=YMR084W;
> 
> So GN line had OrderedLocusNames= but not  Name= or ORFNames=, so it didn't
> use the new parser, like the other entries I came across. Should all 4
> possibilities be tested for in the when statement: (Synonyms= being the
> 4th)?

It seems to be a bug. Perhaps there were no (or very few) entries
which only had OrderedLocusNames= when the code was first written
in 2005. The commit Id in git was b5c3342437ed698f215a87ea72c6cabf0575709d.

The GN format was changed in UniProtKB release 2.0 of 05-Jul-2004. 
The document http://www.uniprot.org/docs/sp_news.htm says:
| The new format of the GN line is:
| 
| GN   Name=<name>; Synonyms=<name1>[, <name2>...]; OrderedLocusNames=<name1>[, <name2>...];
| GN   ORFNames=<name1>[, <name2>...];
| 
| None of the above four tokens are mandatory. But a "Synonyms" token can only be present if there is a "Name" token.

You are right the 4 possibilities should be considered.
"Synonyms" can be eliminated, but it may be safe to be included.

> Also, while I'm here:
> * why does the returned hash have different keys than are in the file? e.g.
> ORFNames becomes :orfs?

I don't know. Now, I think using the same names as described
in the original entries may be preferred, too.

> * I also found the parsing process for whole genomes quite slow (multiple
> hours for well annotated ones).

Please use profiler to find bottlenecks.
 % ruby -rprofile xxx.rb

> * is there any standard way to handle concatenated UniProt files? I wrote my
> own as it was simple.

What type of "concatenated" do you mean?
For simple concatenation, for example, original file distributed
from UniProt FTP site, Bio::FlatFile can be used.
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz
(please gunzip before reading!)

 ff = Bio::FlatFile.open("uniprot_sprot.dat")
 ff.each do |e|
   puts e.entry_id
 end

> 
> Thanks,
> ben

Thank you.

-- 
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org


From tomoakin at kenroku.kanazawa-u.ac.jp  Sat Jan 16 05:36:02 2010
From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA)
Date: Sat, 16 Jan 2010 14:36:02 +0900
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <20100115140059.GA24948@thebird.nl>
References: <20100112101132.GC10308@thebird.nl>
	<20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
	<3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
	<20100113073706.GA25611@thebird.nl>
	<5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
Message-ID: <4B515042.7020204@kenroku.kanazawa-u.ac.jp>

Hi,
Pjotr Prins wrote:
 > On second thought, escaping is less obvious than I thought. I can
 > escape all generated HTML, but that leaves no way to customize the
 > output. Say I want to include an href in a sequence descriptor - which
 > is a fairly typical requirement - that would be disabled.

I agree this. Having a link to original sequence on the name
is usually good idea.

 > I am going to add a 'master' switch for escaping of HTML. The default
 > will be with escaping.

How do you think to test if the object responds to to_html
and then call to_html else pass to escapeHTML.
The object may internally plain text and htmlized text or
plain text plus link information or just the plain text
but cares how is output as html inline element.

If properly imlemented, it can generate a link from "gi|112233|..."
within a text and cache for the converted result.

The object can also simply pass the user supplied html.

I think it is a predictable use that user supplied sequence be aligned
with sequences obtained from databases. Isn't it better to be able to
regard user supplied text as a simple text but the sequence from 
databases having proper link?  This may not be simple with a master switch.


From pjotr.public14 at thebird.nl  Sat Jan 16 08:30:41 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Sat, 16 Jan 2010 09:30:41 +0100
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <4B515042.7020204@kenroku.kanazawa-u.ac.jp>
References: <20100112101132.GC10308@thebird.nl>
	<20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
	<3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
	<20100113073706.GA25611@thebird.nl>
	<5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
Message-ID: <20100116083041.GA2663@thebird.nl>

On Sat, Jan 16, 2010 at 02:36:02PM +0900, Tomoaki NISHIYAMA wrote:
> > I am going to add a 'master' switch for escaping of HTML. The default
> > will be with escaping.
>
> How do you think to test if the object responds to to_html
> and then call to_html else pass to escapeHTML.

In this case the object to convert to HTML is a String and part of
Bio::Alignment. Later implementations of Bio::Alignment could use a
Bio::Sequence.id (or something Naohisa wrote me).  It would mean we
would have to create a Bio::Sequence::Descriptor object, which would
contain several specialistic 'output' generators.

This is a recurrent idea we need to discuss.

I think *all* HTML based stuff should be in its own objects - and its
own tree (I have created bio/output/html for that purpose).

I think it is a bad idea to clutter regular BioRuby code with HTML
specific stuff. Likewise for other outputs, as you pointed out, like
plotting. Output should live in

  bio/lib/output/html
  bio/lib/output/plot
  bio/lib/output/gtk
  bio/lib/output/rails (perhaps)
  (etc)

that way display code never pollutes the simple Bio::Sequence object,
for example. You'll get Bio::Html::Sequence for that - or my
preferred naming Bio::HtmlSequence.

Now if Bio::HtmlSequence could be plugged into Bio::Alignment - the
latter would not care - and we could adapt the HtmlSequence info to
show embedded hrefs. 

That would be the proper way to handle it. No testing of methods
(like to_html), but use the object structure to define what is
supported (and not).

Until we implement that (get Bio::Alignment to support arbitrary
Sequence objects) I think the master switch is fine. I have updated
my branch. Default behaviour is escaping. If a user (like me) wants
it otherwise, it is allowed.

Pj.


From tomoakin at kenroku.kanazawa-u.ac.jp  Sun Jan 17 05:12:35 2010
From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA)
Date: Sun, 17 Jan 2010 14:12:35 +0900
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <20100116083041.GA2663@thebird.nl>
References: <20100112101132.GC10308@thebird.nl>
	<20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
	<3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
	<20100113073706.GA25611@thebird.nl>
	<5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
	<20100116083041.GA2663@thebird.nl>
Message-ID: <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>

Hi,

On 2010/01/16, at 17:30, Pjotr Prins wrote:

> On Sat, Jan 16, 2010 at 02:36:02PM +0900, Tomoaki NISHIYAMA wrote:
>>> I am going to add a 'master' switch for escaping of HTML. The  
>>> default
>>> will be with escaping.
>>
>> How do you think to test if the object responds to to_html
>> and then call to_html else pass to escapeHTML.
>
> In this case the object to convert to HTML is a String and part of
> Bio::Alignment. Later implementations of Bio::Alignment could use a
> Bio::Sequence.id (or something Naohisa wrote me).  It would mean we
> would have to create a Bio::Sequence::Descriptor object, which would
> contain several specialistic 'output' generators.


For the meanwhile I don't expect that sophisticated mechanism to
automatically generate proper HTML, but simply add a mean to
distinguish what should be escaped as a normal course and what
is specifically prepared as html by the user.

A user can write:

class HTMLString < String
   def to_html
     self
   end
end

a = Bio::Alignment.new
a.add_seq('ATCCATGG', HTMLString.new('<a href="http://example.com/ 
path/to/original/seqinfo"><em>a</em></a>'))
# this is html under the responsibility of the programmer

a.add_seq('ATGCATGC', '<b>')
# this is not html; don't care on '<', or '>'

simple = Bio::Html::HtmlAlignment.new(a,
   :title => HTMLString.new('A <em>fancy</em> <b>HTML</b> <i>title</ 
i>'))
html = simple.html()

If Bio::Alignment does not force the object given to be String,
such code should be possible without the change in Bio::Alignment,
and only the HtmlAlignment class and the programmer needs to know it.
So, HTML specific code does not need go to regular BioRuby code.

> That would be the proper way to handle it. No testing of methods
> (like to_html), but use the object structure to define what is
> supported (and not).


I'm not sure what do you mean by "use the object structure".
How do you distinguish a plain text and HTML text?
-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan


On 2010/01/16, at 17:30, Pjotr Prins wrote:

> On Sat, Jan 16, 2010 at 02:36:02PM +0900, Tomoaki NISHIYAMA wrote:
>>> I am going to add a 'master' switch for escaping of HTML. The  
>>> default
>>> will be with escaping.
>>
>> How do you think to test if the object responds to to_html
>> and then call to_html else pass to escapeHTML.
>
> In this case the object to convert to HTML is a String and part of
> Bio::Alignment. Later implementations of Bio::Alignment could use a
> Bio::Sequence.id (or something Naohisa wrote me).  It would mean we
> would have to create a Bio::Sequence::Descriptor object, which would
> contain several specialistic 'output' generators.
>
> This is a recurrent idea we need to discuss.
>
> I think *all* HTML based stuff should be in its own objects - and its
> own tree (I have created bio/output/html for that purpose).
>
> I think it is a bad idea to clutter regular BioRuby code with HTML
> specific stuff. Likewise for other outputs, as you pointed out, like
> plotting. Output should live in
>
>   bio/lib/output/html
>   bio/lib/output/plot
>   bio/lib/output/gtk
>   bio/lib/output/rails (perhaps)
>   (etc)
>
> that way display code never pollutes the simple Bio::Sequence object,
> for example. You'll get Bio::Html::Sequence for that - or my
> preferred naming Bio::HtmlSequence.
>
> Now if Bio::HtmlSequence could be plugged into Bio::Alignment - the
> latter would not care - and we could adapt the HtmlSequence info to
> show embedded hrefs.
>
> That would be the proper way to handle it. No testing of methods
> (like to_html), but use the object structure to define what is
> supported (and not).
>
> Until we implement that (get Bio::Alignment to support arbitrary
> Sequence objects) I think the master switch is fine. I have updated
> my branch. Default behaviour is escaping. If a user (like me) wants
> it otherwise, it is allowed.
>
> Pj.
>


From pjotr.public14 at thebird.nl  Sun Jan 17 13:54:41 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Sun, 17 Jan 2010 14:54:41 +0100
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>
References: <20100112101132.GC10308@thebird.nl>
	<20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
	<3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
	<20100113073706.GA25611@thebird.nl>
	<5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
	<20100116083041.GA2663@thebird.nl>
	<63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>
Message-ID: <20100117135441.GA24341@thebird.nl>

Hi Tomoaki,

Thanks for you responses. I really appreciate it.

On Sun, Jan 17, 2010 at 02:12:35PM +0900, Tomoaki NISHIYAMA wrote:
> A user can write:
>
> class HTMLString < String
>   def to_html
>     self
>   end
> end
>
> a = Bio::Alignment.new
> a.add_seq('ATCCATGG', HTMLString.new('<a href="http://example.com/ 
> path/to/original/seqinfo"><em>a</em></a>'))

There is at least one 'problem' with this approach.

This assumes that Bio::Alignment will keep its current implementation.
Currently Bio::Alignment stores a list of descriptions, and a list of
sequences. As Naohisa wrote me two weeks ago, this is before
Bio::Sequence had its own identifier/descriptor. If we redesign
Bio::Alignment there is a large chance we will store Bio::Sequence
instead of two lists (I, for one, would certainly favour that).

The other problem is more about OOP. In your example you say once it
is an HTML object (HTMLString) and next you add a specific method for
html 'to_html'. Twice it is 'told' that it generates HTML. 'to_html'
also implies something of a transformation. We should opt for a
different method name (generate_html, perhaps, or html)

class HTMLString
  def html
  end
end

The 'responsibility' of the output is with HTMLString. Good. This way an
implementation of Bio::Alignment does not need to know about HTML,
but still can generate the output, at the user's request.

> # this is html under the responsibility of the programmer
>
> a.add_seq('ATGCATGC', '<b>')
> # this is not html; don't care on '<', or '>'
>
> simple = Bio::Html::HtmlAlignment.new(a,
>   :title => HTMLString.new('A <em>fancy</em> <b>HTML</b> <i>title</i>'))
> html = simple.html()
>
> If Bio::Alignment does not force the object given to be String,
> such code should be possible without the change in Bio::Alignment,
> and only the HtmlAlignment class and the programmer needs to know it.
> So, HTML specific code does not need go to regular BioRuby code.

HTMLAlignment should not care either how the HTML is generated.. It is
really up to the container holding the sequence, or description, what
the output is.

What I don't like about proposed approach is that HTMLAlignment gets
an object, needs to check for an 'to_html or html' method (ugly), and
if it does not exist, needs to escape the information (by calling the
to_s method?). That is a lot of formal checking I need to do for
every output generated.

>> That would be the proper way to handle it. No testing of methods
>> (like to_html), but use the object structure to define what is
>> supported (and not).
>
> I'm not sure what do you mean by "use the object structure".
> How do you distinguish a plain text and HTML text?

The output is generated by an HTML aware container. We can agree to
use one method 'html' method.

Create different types of objects:

  HTMLSequence.html - generates formatted HTML
  ColorHTMLSequence.html - generates formatted color HTML
  EscapedHTMLSequence.html - generated escaped native stuff

And if someone wanted it, he could create:

  Sequence.html  - generates plain text

This would prevent downstream 'checking' of object responsibilities.
We can assume the user knows he is going to use HTMLAlignment and
therefore we can expect him to pass in a known HTML supported
Sequence object.

The reason to get the responsibility in the right place is to create
as clean as possible code. You really don't want downstream checking
of methods.

We can further discuss in Japan. At least it is clear we have several
options.

Pj.


> -- 
> Tomoaki NISHIYAMA
>
> Advanced Science Research Center,
> Kanazawa University,
> 13-1 Takara-machi,
> Kanazawa, 920-0934, Japan
>
>
> On 2010/01/16, at 17:30, Pjotr Prins wrote:
>
>> On Sat, Jan 16, 2010 at 02:36:02PM +0900, Tomoaki NISHIYAMA wrote:
>>>> I am going to add a 'master' switch for escaping of HTML. The  
>>>> default
>>>> will be with escaping.
>>>
>>> How do you think to test if the object responds to to_html
>>> and then call to_html else pass to escapeHTML.
>>
>> In this case the object to convert to HTML is a String and part of
>> Bio::Alignment. Later implementations of Bio::Alignment could use a
>> Bio::Sequence.id (or something Naohisa wrote me).  It would mean we
>> would have to create a Bio::Sequence::Descriptor object, which would
>> contain several specialistic 'output' generators.
>>
>> This is a recurrent idea we need to discuss.
>>
>> I think *all* HTML based stuff should be in its own objects - and its
>> own tree (I have created bio/output/html for that purpose).
>>
>> I think it is a bad idea to clutter regular BioRuby code with HTML
>> specific stuff. Likewise for other outputs, as you pointed out, like
>> plotting. Output should live in
>>
>>   bio/lib/output/html
>>   bio/lib/output/plot
>>   bio/lib/output/gtk
>>   bio/lib/output/rails (perhaps)
>>   (etc)
>>
>> that way display code never pollutes the simple Bio::Sequence object,
>> for example. You'll get Bio::Html::Sequence for that - or my
>> preferred naming Bio::HtmlSequence.
>>
>> Now if Bio::HtmlSequence could be plugged into Bio::Alignment - the
>> latter would not care - and we could adapt the HtmlSequence info to
>> show embedded hrefs.
>>
>> That would be the proper way to handle it. No testing of methods
>> (like to_html), but use the object structure to define what is
>> supported (and not).
>>
>> Until we implement that (get Bio::Alignment to support arbitrary
>> Sequence objects) I think the master switch is fine. I have updated
>> my branch. Default behaviour is escaping. If a user (like me) wants
>> it otherwise, it is allowed.
>>
>> Pj.
>>
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From donttrustben at gmail.com  Tue Jan 19 02:15:30 2010
From: donttrustben at gmail.com (Ben Woodcroft)
Date: Tue, 19 Jan 2010 12:15:30 +1000
Subject: [BioRuby] SPTR problem
In-Reply-To: <20100115171913.4396B1CBC40C@idnmail.gen-info.osaka-u.ac.jp>
References: <bb2b67d01001120452t2c9bd748v1992b6d1812d418@mail.gmail.com>
	<20100115171913.4396B1CBC40C@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <bb2b67d01001181815h149b1539x2f9701670cca8ecc@mail.gmail.com>

Hi,

Thanks for the response. embedded.

2010/1/16 Naohisa GOTO <ngoto at gen-info.osaka-u.ac.jp>

>
> It seems to be a bug. Perhaps there were no (or very few) entries
> which only had OrderedLocusNames= when the code was first written
> in 2005. The commit Id in git was b5c3342437ed698f215a87ea72c6cabf0575709d.
>

I was figuring that. Also, since no actual exception was thrown, errors
might not have been noticed. I wrote a patch for this that I've been using
internally, but haven't included unit tests.
http://github.com/wwood/bioruby/commit/b2f6cb0b
Happy to write tests, but you seem to rewrite my patches anyway..


>
> The GN format was changed in UniProtKB release 2.0 of 05-Jul-2004.
> The document http://www.uniprot.org/docs/sp_news.htm says:
> | The new format of the GN line is:
> |
> | GN   Name=<name>; Synonyms=<name1>[, <name2>...];
> OrderedLocusNames=<name1>[, <name2>...];
> | GN   ORFNames=<name1>[, <name2>...];
> |
> | None of the above four tokens are mandatory. But a "Synonyms" token can
> only be present if there is a "Name" token.
>
> You are right the 4 possibilities should be considered.
> "Synonyms" can be eliminated, but it may be safe to be included.
>
> > Also, while I'm here:
> > * why does the returned hash have different keys than are in the file?
> e.g.
> > ORFNames becomes :orfs?
>
> I don't know. Now, I think using the same names as described
> in the original entries may be preferred, too.
>

What do you suggest we do about this?


>
> > * I also found the parsing process for whole genomes quite slow (multiple
> > hours for well annotated ones).
>
> Please use profiler to find bottlenecks.
>  % ruby -rprofile xxx.rb
>

I tried to do something like that but in the end found it easier to pre-grep
the uniprot file, keeping only the lines relevant to me. There was too many
levels of indirection in my code for me to bother tracking it down.


>
> > * is there any standard way to handle concatenated UniProt files? I wrote
> my
> > own as it was simple.
>
> What type of "concatenated" do you mean?
> For simple concatenation, for example, original file distributed
> from UniProt FTP site, Bio::FlatFile can be used.
>
> ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz
> (please gunzip before reading!)
>
>  ff = Bio::FlatFile.open("uniprot_sprot.dat")
>  ff.each do |e|
>   puts e.entry_id
>  end
>

More evidence I'm an idiot. Like I needed any.
Thanks,
ben


From pjotr.public14 at thebird.nl  Tue Jan 19 10:50:56 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Tue, 19 Jan 2010 11:50:56 +0100
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <20100117135441.GA24341@thebird.nl>
References: <20100112101132.GC10308@thebird.nl>
	<20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
	<3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
	<20100113073706.GA25611@thebird.nl>
	<5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
	<20100116083041.GA2663@thebird.nl>
	<63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>
	<20100117135441.GA24341@thebird.nl>
Message-ID: <20100119105056.GA29525@thebird.nl>

Based on Tomoaki's comments I propose the following:

The requirements are:

  A- input objects that know about HTML should generate that
  B- other input files get escapeHTML(object.to_s)

For a container/displayer to recognize object A, object A should have
a method to_html:

  class ObjectA 
    def to_html
    end
  end

If to_html does not exist to_s is called - and escaped. The principle
will go into a mixin for the container class.

Everyone OK with this? 

Pj.


From ktym at hgc.jp  Tue Jan 19 12:41:31 2010
From: ktym at hgc.jp (Toshiaki Katayama)
Date: Tue, 19 Jan 2010 21:41:31 +0900
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <20100119105056.GA29525@thebird.nl>
References: <20100112101132.GC10308@thebird.nl>
	<20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
	<3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
	<20100113073706.GA25611@thebird.nl>
	<5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
	<20100116083041.GA2663@thebird.nl>
	<63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>
	<20100117135441.GA24341@thebird.nl>
	<20100119105056.GA29525@thebird.nl>
Message-ID: <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp>

Dear Pj and all,

I'm sorry that I could not spare enough time to follow this thread
but I'd like to add some comments.

Firstly, I don't like to use the method name 'to_html' as we already
deprecated to use 'to_fasta' because 'to_' is reserved for conversion
of the class in Ruby's convention (above two methods just convert
String to String).

We (Nakao-san and me) are now working to improve our TogoWS service
(http://togows.dbcls.jp) by supporting RDF output. I hope to propose
a generalized way to achieve this (hopefully, before the BioHackathon
2010 http://hackathon3.dbcls.jp/).

Our current attempt is to have an 'output' method in the Bio::DB class
and each sub-class implements actual 'output_*' methods relevant
to appropriate formats.

# This kind of requirements may also be true for classes other than
# the Bio::DB (for example, Bio::Sequence, Alignment, Newick classes),
# so we may put this interface in the top level class (Bio::Root?),
# which does not exist for now, though.

In TogoWS, we internally use the BioRuby library, and the URI

http://togows.dbcls.jp/entry/exampledb/1/definition

is sent to the 'definition' method defined in the Bio::ExampleDB class.
Similarly, we can map '.' notation in the following URLs to call output
method using their suffix as a format specifier.

http://togows.dbcls.jp/entry/exampledb/1.rdf
http://togows.dbcls.jp/entry/exampledb/1.fasta

Therefore, these can be mapped to output(:rdf) and output(:fasta) method
calls to the Bio::ExampleDB class, respectively.

All we need to do is to add these methods in every database class
comprehensively.

I think this is simple enough and beautiful.
I'll attach a primitive pseudo code in below.
Comments are welcome.

Regards,
Toshiaki Katayama


module Bio
  class DB
    def output(format)
      send("output_#{format.to_s.downcase}")
    end
  end
end

module Bio
  class ExampleDB < DB
    # output sequence of the entry in FASTA format
    def output_fasta
      ">#{@entry_id} #{@definition}\n#{@sequence}\n"
    end

    # output contents of the entry in RDF (N3) format
    def output_rdf
      prefix_subject   = "http://togows.dbcls.jp/entry/exampledb"
      prefix_predicate = "http://togows.dbcls.jp/ontology/exampledb"
      "<#{prefix_subject}/#{@entry_id}>\t<#{prefix_predicate}#definition>\t#{@definition} .\n" +
      "<#{prefix_subject}/#{@entry_id}>\t<#{prefix_predicate}#sequence>\t#{@sequence} .\n"
    end

    # output contents of the entry in HTML format
    def output_html
      "<h1>#{@entry_id}</h1> ... blah, blah, blah ..."
    end
  end
end

entry = Bio::ExampleDB.new(str)

entry.output(:fasta)
# =>
# >ENTRY_ID
# atgcatgcatgcatgcatgc

entry.output(:rdf)
# =>
# <http://togows.dbcls.jp/entry/exampledb/ENTRY_ID>	<http://togows.dbcls.jp/ontology/exampledb#definition>	"DEFINITION" .
# <http://togows.dbcls.jp/entry/exampledb/ENTRY_ID>	<http://togows.dbcls.jp/ontology/exampledb#seqence>	"atgcatgcatgcatgc" .


On 2010/01/19, at 19:50, Pjotr Prins wrote:

> Based on Tomoaki's comments I propose the following:
> 
> The requirements are:
> 
>  A- input objects that know about HTML should generate that
>  B- other input files get escapeHTML(object.to_s)
> 
> For a container/displayer to recognize object A, object A should have
> a method to_html:
> 
>  class ObjectA 
>    def to_html
>    end
>  end
> 
> If to_html does not exist to_s is called - and escaped. The principle
> will go into a mixin for the container class.
> 
> Everyone OK with this? 
> 
> Pj.
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From tomoakin at kenroku.kanazawa-u.ac.jp  Tue Jan 19 14:05:17 2010
From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA)
Date: Tue, 19 Jan 2010 23:05:17 +0900
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp>
References: <20100112101132.GC10308@thebird.nl>
	<20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
	<3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
	<20100113073706.GA25611@thebird.nl>
	<5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
	<20100116083041.GA2663@thebird.nl>
	<63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>
	<20100117135441.GA24341@thebird.nl>
	<20100119105056.GA29525@thebird.nl>
	<0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp>
Message-ID: <F369EC8A-F74A-43CE-A030-B259C5F45B4D@kenroku.kanazawa-u.ac.jp>

Hi,

> Firstly, I don't like to use the method name 'to_html' as we already
> deprecated to use 'to_fasta' because 'to_' is reserved for conversion
> of the class in Ruby's convention (above two methods just convert
> String to String).

I think HTML and String should be actually a different class.
There are to_i and to_f for conversion between subclasses of Numeric,
yet this isn't denied because the conversion is Numeric to Numeric.

a string "<a href=example.com> aaa</a>" in HTML is
"&lt;a href=example.com&gt; aaa&lt;/a&gt;" but
HTML "<a href=example.com> aaa</a>" in HTML is "<a href=example.com>  
aaa</a>"

The return value of to_html should be a different class than String.

So, the point is
>     def output_html
>       "<h1>#{@entry_id}</h1> ... blah, blah, blah ..."
>     end

how to regulate the different behavior of @entry_id.
If the nature of entry_id is plain text, that should be escaped.
On the other hand sometimes the user may want to use html aware
object for whatever purpose (color, link, etc...).
When we want to mix them with data supplied
from outside, say user input into CGI, those data shall usually
be treated as plain text and suppress any interference with html.

#!/usr/local/bin/ruby
require 'bio'
require 'cgi'

class Bio::HTMLString < String
   def to_html
     self
   end
end
def Bio::generate_html(object)
   if object.respond_to?(:to_html)
     object.to_html
   else
     string = CGI.escapeHTML(object.to_s) #fall back to escaping
     Bio::HTMLString.new(string)
   end
end

p Bio::generate_html(12)
p Bio::generate_html(Bio::HTMLString.new('<a href=example.com> aaa</ 
a>'))
p Bio::generate_html('<a href=example.com> aaa</a>')
-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan


From pjotr.public14 at thebird.nl  Tue Jan 19 14:34:22 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Tue, 19 Jan 2010 15:34:22 +0100
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp>
References: <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
	<20100113073706.GA25611@thebird.nl>
	<5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
	<20100116083041.GA2663@thebird.nl>
	<63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>
	<20100117135441.GA24341@thebird.nl>
	<20100119105056.GA29525@thebird.nl>
	<0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp>
Message-ID: <20100119143422.GA1781@thebird.nl>

On Tue, Jan 19, 2010 at 09:41:31PM +0900, Toshiaki Katayama wrote:
> All we need to do is to add these methods in every database class
> comprehensively.
> 
> I think this is simple enough and beautiful.
> I'll attach a primitive pseudo code in below.
> Comments are welcome.

I agree with Tomoaki it is too restrictive. What, indeed, if we want
to present the HTML in a different way?

The second comment is that I dislike the way the current files like
sequence.rb and alignment.rb are mushrooming in size. There is much
too much in there, which discourages people from diving in. I believe
code should be readable, and easy to understand/digest.

Sticking in output 'details', like HTML generation, does not help.

I really would like all HTML to be in one sub-tree. Also XML, RDF and
whatnot. When it is 'business' logic it should be in database. When it
is output transformations it is not 'business' logic any longer.

Don't you think the Sequence, or KEGG, object should not care about
HTML? Or RDF, or plotting? Those are separate functionalities. They
share common access patterns - which are part of the DB class.

Finally, why not use method names? What is the added value of 

  output(:html)

over 

  output_html

Pj.


From ktym at hgc.jp  Tue Jan 19 15:33:30 2010
From: ktym at hgc.jp (Toshiaki Katayama)
Date: Wed, 20 Jan 2010 00:33:30 +0900
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <F369EC8A-F74A-43CE-A030-B259C5F45B4D@kenroku.kanazawa-u.ac.jp>
References: <20100112101132.GC10308@thebird.nl>
	<20100113025801.42F981CBC415@idnmail.gen-info.osaka-u.ac.jp>
	<3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
	<20100113073706.GA25611@thebird.nl>
	<5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
	<20100116083041.GA2663@thebird.nl>
	<63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>
	<20100117135441.GA24341@thebird.nl>
	<20100119105056.GA29525@thebird.nl>
	<0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp>
	<F369EC8A-F74A-43CE-A030-B259C5F45B4D@kenroku.kanazawa-u.ac.jp>
Message-ID: <FBDE48DC-0B56-4AFA-BC08-4F198FE48778@hgc.jp>

Nishiyama-san,

I couldn't catch what you are trying to do...
(maybe because I didn't read throughout the thread)

On 2010/01/19, at 23:05, Tomoaki NISHIYAMA wrote:

> Hi,
> 
>> Firstly, I don't like to use the method name 'to_html' as we already
>> deprecated to use 'to_fasta' because 'to_' is reserved for conversion
>> of the class in Ruby's convention (above two methods just convert
>> String to String).
> 
> I think HTML and String should be actually a different class.
> There are to_i and to_f for conversion between subclasses of Numeric,
> yet this isn't denied because the conversion is Numeric to Numeric.
> 
> a string "<a href=example.com> aaa</a>" in HTML is
> "&lt;a href=example.com&gt; aaa&lt;/a&gt;" but
> HTML "<a href=example.com> aaa</a>" in HTML is "<a href=example.com> aaa</a>"
> 
> The return value of to_html should be a different class than String.

If the method is named as to_html, it might return a HTML object.

But, from my view point, a html string is still just a text
and escaping the html string is responsibility of a programmer
depending on where the string will be used.


> 
> So, the point is
>>    def output_html
>>      "<h1>#{@entry_id}</h1> ... blah, blah, blah ..."
>>    end
> 
> how to regulate the different behavior of @entry_id.
> If the nature of entry_id is plain text, that should be escaped.
> On the other hand sometimes the user may want to use html aware
> object for whatever purpose (color, link, etc...).
> When we want to mix them with data supplied
> from outside, say user input into CGI, those data shall usually
> be treated as plain text and suppress any interference with html.

I'm talking about a database class and the contents of
@entry_id is a string parsed from an flat file entry of
that database (not come from outside).


> 
> #!/usr/local/bin/ruby
> require 'bio'
> require 'cgi'
> 
> class Bio::HTMLString < String
>  def to_html
>    self
>  end
> end
> def Bio::generate_html(object)
>  if object.respond_to?(:to_html)
>    object.to_html
>  else
>    string = CGI.escapeHTML(object.to_s) #fall back to escaping
>    Bio::HTMLString.new(string)
>  end
> end
> 
> p Bio::generate_html(12)
> p Bio::generate_html(Bio::HTMLString.new('<a href=example.com> aaa</a>'))
> p Bio::generate_html('<a href=example.com> aaa</a>')

Why we need to have this functionality under the Bio name space?

Toshiaki

> -- 
> Tomoaki NISHIYAMA
> 
> Advanced Science Research Center,
> Kanazawa University,
> 13-1 Takara-machi,
> Kanazawa, 920-0934, Japan
> 


From ktym at hgc.jp  Tue Jan 19 16:21:54 2010
From: ktym at hgc.jp (Toshiaki Katayama)
Date: Wed, 20 Jan 2010 01:21:54 +0900
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <20100119143422.GA1781@thebird.nl>
References: <3451D169-0A9E-4209-8224-FF3094AC876B@kenroku.kanazawa-u.ac.jp>
	<20100113073706.GA25611@thebird.nl>
	<5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
	<20100116083041.GA2663@thebird.nl>
	<63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>
	<20100117135441.GA24341@thebird.nl>
	<20100119105056.GA29525@thebird.nl>
	<0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp>
	<20100119143422.GA1781@thebird.nl>
Message-ID: <14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp>

Dear Pj,

On 2010/01/19, at 23:34, Pjotr Prins wrote:

> On Tue, Jan 19, 2010 at 09:41:31PM +0900, Toshiaki Katayama wrote:
>> All we need to do is to add these methods in every database class
>> comprehensively.
>> 
>> I think this is simple enough and beautiful.
>> I'll attach a primitive pseudo code in below.
>> Comments are welcome.
> 
> I agree with Tomoaki it is too restrictive. What, indeed, if we want
> to present the HTML in a different way?

Hmm. Could you provide me some use cases?

Override the output_html method, or, use some template engine to be
more generic.


> 
> The second comment is that I dislike the way the current files like
> sequence.rb and alignment.rb are mushrooming in size. There is much
> too much in there, which discourages people from diving in. I believe
> code should be readable, and easy to understand/digest.

I can agree some files became too large to learn and/or maintain.
But if we try to change the structure of current code base,
we need to define a clean criteria beforehand.

If we separate files into sub files, people then need to look around
the number of files, and it may also slow down the loading speed of
the bioruby library. It is a problem of balance.

In both cases, lack of excellent guide to read through the bioruby
library might be a essential issue.


> 
> Sticking in output 'details', like HTML generation, does not help.
> 
> I really would like all HTML to be in one sub-tree. Also XML, RDF and
> whatnot. When it is 'business' logic it should be in database. When it
> is output transformations it is not 'business' logic any longer.

I'm not sure about HTML but FASTA and RDF, for example, are tightly
related to the original database format/contents. So, I proposed
to have methods to generate formatted string in each database class.

There can be many ways to design OO class trees and to find the best
way to represent/abstract things is always a difficult task.

At some time, we may do refactoring to produce BioRuby 2.0.
Before doing that, we can discuss how to sit all classes/codes cleanly.
We may need someone who understand entire structure/contents of
the current codebase and willing to design a better one with a good sense.


> 
> Don't you think the Sequence, or KEGG, object should not care about
> HTML? Or RDF, or plotting? Those are separate functionalities. They
> share common access patterns - which are part of the DB class.

Again, we can take both approach. My current proposal is conservative one.
Just add these functionalities in each class as the class knows what is in it
and what is the best way to represent the contents.

If we separate formatting/plotting functionalities into separate class,
which might be something like Bio::FlatFile class who knows the header
line format of every database entries. Or we may design better one.

Anyway, I'm now listening. So, please don't stick with HTML things only
and think a global design to which we can plan to migrate.


> 
> Finally, why not use method names? What is the added value of 
> 
>  output(:html)
> 
> over 
> 
>  output_html
> 
> Pj.

Maybe from esthetics viewpoint?

I think it looks better, and, we can easily switch the output format
depending on the context without modifying the code.
Something like a @media property in CSS (screen, print etc.) in mind.

if used_for_semantic_web?
  format = :rdf
  # add some codes to do preparation job for SW
elsif used_for_blast?
  format = :fasta
  # add some codes to do preparation job for blast
end

# we don't need to change the following line in any context
entry.output(format)

Toshiaki


From pjotr.public14 at thebird.nl  Tue Jan 19 20:52:41 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Tue, 19 Jan 2010 21:52:41 +0100
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp>
References: <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
	<20100116083041.GA2663@thebird.nl>
	<63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>
	<20100117135441.GA24341@thebird.nl>
	<20100119105056.GA29525@thebird.nl>
	<0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp>
	<20100119143422.GA1781@thebird.nl>
	<14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp>
Message-ID: <20100119205241.GA7043@thebird.nl>

Dear Toshiaki,

On Wed, Jan 20, 2010 at 01:21:54AM +0900, Toshiaki Katayama wrote:
> > I agree with Tomoaki it is too restrictive. What, indeed, if we want
> > to present the HTML in a different way?
> 
> Hmm. Could you provide me some use cases?

Think of URL's. One user wants to point a gene ID to NCBI. Another
to Swissprot. The container can not be aware of all exceptions - and
really should not handle it.

> Override the output_html method, or, use some template engine to be
> more generic.

Maybe those are good mechanisms. In the pre-hackathon we should
discuss these points.

> I can agree some files became too large to learn and/or maintain.
> But if we try to change the structure of current code base,
> we need to define a clean criteria beforehand.

Yes.

> If we separate files into sub files, people then need to look around
> the number of files, and it may also slow down the loading speed of
> the bioruby library. It is a problem of balance.
> 
> In both cases, lack of excellent guide to read through the bioruby
> library might be a essential issue.

I think if we structure the files and modules well - and make them
small enough - they become self-explaining. That would be my ultimate
goal.

> At some time, we may do refactoring to produce BioRuby 2.0.
> Before doing that, we can discuss how to sit all classes/codes cleanly.
> We may need someone who understand entire structure/contents of
> the current codebase and willing to design a better one with a good sense.

Yes. I agree it is a big step. But we should go for this type of
challenge.

> > Don't you think the Sequence, or KEGG, object should not care about
> > HTML? Or RDF, or plotting? Those are separate functionalities. They
> > share common access patterns - which are part of the DB class.
> 
> Again, we can take both approach. My current proposal is conservative one.
> Just add these functionalities in each class as the class knows what is in it
> and what is the best way to represent the contents.
> 
> If we separate formatting/plotting functionalities into separate class,
> which might be something like Bio::FlatFile class who knows the header
> line format of every database entries. Or we may design better one.

FlatFile has some downsides. It has complicated the libraries.
Complication means the modules are less easy to adapt/modify. I think
it is slightly over-engineered. Maybe not enough of a problem to take
it out, but I hope you see where I am coming from.

> Anyway, I'm now listening. So, please don't stick with HTML things only
> and think a global design to which we can plan to migrate.

I have to spend a day on a writeup. In the coming two weeks. I will
try to explain my ideas.

> Maybe from esthetics viewpoint?
> 
> I think it looks better, and, we can easily switch the output format
> depending on the context without modifying the code.
> Something like a @media property in CSS (screen, print etc.) in mind.
> 
> if used_for_semantic_web?
>   format = :rdf
>   # add some codes to do preparation job for SW
> elsif used_for_blast?
>   format = :fasta
>   # add some codes to do preparation job for blast
> end
> 
> # we don't need to change the following line in any context
> entry.output(format)

I see your point. The criticism is that it obfuscates the real
intention of the code - i.e. it is not self documenting any longer.
But, I guess, this boils down to preferences and acquired tastes. It
is not obvious to a newbie, though it may be obvious for someone who
is accustomed to Bioruby internals. Which may be good - depending on
our basic values.

Pj.


From ktym at hgc.jp  Wed Jan 20 00:49:37 2010
From: ktym at hgc.jp (Toshiaki Katayama)
Date: Wed, 20 Jan 2010 09:49:37 +0900
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <20100119205241.GA7043@thebird.nl>
References: <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
	<20100116083041.GA2663@thebird.nl>
	<63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>
	<20100117135441.GA24341@thebird.nl>
	<20100119105056.GA29525@thebird.nl>
	<0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp>
	<20100119143422.GA1781@thebird.nl>
	<14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp>
	<20100119205241.GA7043@thebird.nl>
Message-ID: <DB581F4F-C57E-4007-B3B9-6BFB89BC20CE@hgc.jp>

Dear Pj,

On 2010/01/20, at 5:52, Pjotr Prins wrote:

> Dear Toshiaki,
> 
> On Wed, Jan 20, 2010 at 01:21:54AM +0900, Toshiaki Katayama wrote:
>>> I agree with Tomoaki it is too restrictive. What, indeed, if we want
>>> to present the HTML in a different way?
>> 
>> Hmm. Could you provide me some use cases?
> 
> Think of URL's. One user wants to point a gene ID to NCBI. Another
> to Swissprot. The container can not be aware of all exceptions - and
> really should not handle it.

Still not clear to me.

I supposed to generate a URL string for the href attribute of <a>.
However, is there any IDs which needs to be escaped?
Or do you mean to embed a HTML snippet in URL?
If so, we may need to use URL encoding (URI.escape) 
instead of the HTML escaping (CGI.escapeHTML).


> 
>> Override the output_html method, or, use some template engine to be
>> more generic.
> 
> Maybe those are good mechanisms. In the pre-hackathon we should
> discuss these points.

Is there any better replacement for Ruby's CGI library available?

Requirements:

- separation of the HTML from CGI

CGI.escapeHTML looks ugly in terms of the naming convention (CamelCase)
and the name space -- why not HTML.escape(string). Moreover, we don't
want to require 'cgi' just for escaping a HTML string.

- support for templates (separation of logic and presentation)

I had used erb and html-template. Sometimes erb is too slow (especially
when it contains a nested loop to generate a number of lists or tables).

- bundled with Ruby as a standard library

Otherwise, we'd better to use Rails as a default environment
(from a viewpoint of popularity).


> 
>> I can agree some files became too large to learn and/or maintain.
>> But if we try to change the structure of current code base,
>> we need to define a clean criteria beforehand.
> 
> Yes.
> 
>> If we separate files into sub files, people then need to look around
>> the number of files, and it may also slow down the loading speed of
>> the bioruby library. It is a problem of balance.
>> 
>> In both cases, lack of excellent guide to read through the bioruby
>> library might be a essential issue.
> 
> I think if we structure the files and modules well - and make them
> small enough - they become self-explaining. That would be my ultimate
> goal.
> 
>> At some time, we may do refactoring to produce BioRuby 2.0.
>> Before doing that, we can discuss how to sit all classes/codes cleanly.
>> We may need someone who understand entire structure/contents of
>> the current codebase and willing to design a better one with a good sense.
> 
> Yes. I agree it is a big step. But we should go for this type of
> challenge.
> 
>>> Don't you think the Sequence, or KEGG, object should not care about
>>> HTML? Or RDF, or plotting? Those are separate functionalities. They
>>> share common access patterns - which are part of the DB class.
>> 
>> Again, we can take both approach. My current proposal is conservative one.
>> Just add these functionalities in each class as the class knows what is in it
>> and what is the best way to represent the contents.
>> 
>> If we separate formatting/plotting functionalities into separate class,
>> which might be something like Bio::FlatFile class who knows the header
>> line format of every database entries. Or we may design better one.
> 
> FlatFile has some downsides. It has complicated the libraries.
> Complication means the modules are less easy to adapt/modify. I think
> it is slightly over-engineered. Maybe not enough of a problem to take
> it out, but I hope you see where I am coming from.
> 
>> Anyway, I'm now listening. So, please don't stick with HTML things only
>> and think a global design to which we can plan to migrate.
> 
> I have to spend a day on a writeup. In the coming two weeks. I will
> try to explain my ideas.


OK, let's discuss about these topics as well, during the pre-hackathon
meeting (7th Feb) in Tokyo with other core developers.


> 
>> Maybe from esthetics viewpoint?
>> 
>> I think it looks better, and, we can easily switch the output format
>> depending on the context without modifying the code.
>> Something like a @media property in CSS (screen, print etc.) in mind.
>> 
>> if used_for_semantic_web?
>>  format = :rdf
>>  # add some codes to do preparation job for SW
>> elsif used_for_blast?
>>  format = :fasta
>>  # add some codes to do preparation job for blast
>> end
>> 
>> # we don't need to change the following line in any context
>> entry.output(format)
> 
> I see your point. The criticism is that it obfuscates the real
> intention of the code - i.e. it is not self documenting any longer.
> But, I guess, this boils down to preferences and acquired tastes. It
> is not obvious to a newbie, though it may be obvious for someone who
> is accustomed to Bioruby internals. Which may be good - depending on
> our basic values.
> 
> Pj.


Note that, you can still directly use the output_html method in each
database class. The output(format) method is prepared just as an abstract
interface, which will be useful in the above situation, for example.

Therefore, following both cases should return the same result and
you can choose the coding style depending on the situation.

# case 1
format = :rdf
entry.output(format)

# case 2
entry.output_rdf

You can also check entry.respond_to?(:output_rdf) in both cases.

Toshiaki


From pjotr.public14 at thebird.nl  Wed Jan 20 07:36:44 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Wed, 20 Jan 2010 08:36:44 +0100
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp>
References: <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
	<20100116083041.GA2663@thebird.nl>
	<63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>
	<20100117135441.GA24341@thebird.nl>
	<20100119105056.GA29525@thebird.nl>
	<0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp>
	<20100119143422.GA1781@thebird.nl>
	<14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp>
Message-ID: <20100120073644.GA11295@thebird.nl>

Dear Toshiaki,

On Wed, Jan 20, 2010 at 01:21:54AM +0900, Toshiaki Katayama wrote:
> > I really would like all HTML to be in one sub-tree. Also XML, RDF and
> > whatnot. When it is 'business' logic it should be in database. When it
> > is output transformations it is not 'business' logic any longer.
> 
> I'm not sure about HTML but FASTA and RDF, for example, are tightly
> related to the original database format/contents. So, I proposed
> to have methods to generate formatted string in each database class.
> 
> There can be many ways to design OO class trees and to find the best
> way to represent/abstract things is always a difficult task.

I wrote a nice alignment HTML output generator. Which also displays PAML
output. Currently it is in bio/output/html/htmlalignment.rb and the
class is named Bio::Html::Alignment. 

For the current Bioruby, where do you want to put that? I don't feel
it should be cluttering alignment.rb. Naohisa has suggested
bio/alignment/output/html/alignment.rb instead. I feel uncomfortable
with this. But it is kinda consistent with above, tightly relating it
to the alignment object.

What do you think of the class name?

The code is in my color-alignment branch, see

  http://github.com/pjotrp/bioruby/tree/color-alignment

Is anyone else interested in this type of discussion? We can take it
off-list.

Pj.


From missy at be.to  Wed Jan 20 09:17:50 2010
From: missy at be.to (MISHIMA, Hiroyuki)
Date: Wed, 20 Jan 2010 18:17:50 +0900
Subject: [BioRuby] trouble on the FASTA.QUAL format (Bio::FastaNumericFormat)
Message-ID: <4B56CA3E.8000905@be.to>

Hi all,

I am using BioRuby 1.4.0., and have a trouble in handling the FASTA.QUAL
format using Bio::FastaNumericFormat.

Please see the following code:
========================
require 'rubygems'
require 'bio'

FASTA_QUAL =<<'EOS'
>SAMPLE1
30 30 29 42
EOS

qual = Bio::FastaNumericFormat.new(FASTA_QUAL)
bs = qual.to_biosequence
puts bs.output(:raw)
=========================

The last line raise an error:

=========================
(eval):2:in `__get__seq': undefined method `seq' for 
#<Bio::FastaNumericFormat:0x2b182810ceb0> (NoMethodError)
         from (eval):4:in `seq'
         from 
/home/misshie/.gem/ruby/1.8/gems/bio-1.4.0/lib/bio/sequence/format_raw.rb:19:in 
`output'
         from 
/home/misshie/.gem/ruby/1.8/gems/bio-1.4.0/lib/bio/sequence/format.rb:97:in 
`output'
         from 
/home/misshie/.gem/ruby/1.8/gems/bio-1.4.0/lib/bio/sequence/format.rb:172:in 
`output'
         from fasta_numeric_format.rb:11
=========================

In the last line, using :fasta, :fasta_numeric etc. make same results.

Please let me know if you have ideas to solve this problem.

Hiro.
-- 
MISHIMA, Hiroyuki, DDS, Ph.D.
COE Research Fellow
Department of Human Genetics
Nagasaki University Graduate School of Biomedical Sciences


From andrew.j.grimm at gmail.com  Wed Jan 20 12:09:19 2010
From: andrew.j.grimm at gmail.com (Andrew Grimm)
Date: Wed, 20 Jan 2010 23:09:19 +1100
Subject: [BioRuby] Thread-safety of alignment
Message-ID: <b9140daa1001200409u24a6cc7dgbc74cf4725b3f4e1@mail.gmail.com>

Is alignment intended to be thread-safe in bioruby? If so, should I
use the same alignment factory between threads, or a separate one in
each thread?

Andrew


From ngoto at gen-info.osaka-u.ac.jp  Wed Jan 20 13:36:29 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Wed, 20 Jan 2010 22:36:29 +0900
Subject: [BioRuby] trouble on the FASTA.QUAL format
 (Bio::FastaNumericFormat)
In-Reply-To: <4B56CA3E.8000905@be.to>
References: <4B56CA3E.8000905@be.to>
Message-ID: <20100120133630.052BF1CBC433@idnmail.gen-info.osaka-u.ac.jp>

Hi,

This is a bug, and will be fixed.
Indeed, Bio::FastaNumericFormat does not contain sequence,
and I forgot to take care about calling to_biosequence.

For a workaroud,

  qual = Bio::FastaNumericFormat.new(FASTA_QUAL)
  bs = Bio::Sequence.new('')
  bs.quality_scores = qual.data
  puts bs.output(:fasta_numeric)

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

On Wed, 20 Jan 2010 18:17:50 +0900
"MISHIMA, Hiroyuki" <missy at be.to> wrote:

> Hi all,
> 
> I am using BioRuby 1.4.0., and have a trouble in handling the FASTA.QUAL
> format using Bio::FastaNumericFormat.
> 
> Please see the following code:
> ========================
> require 'rubygems'
> require 'bio'
> 
> FASTA_QUAL =<<'EOS'
> >SAMPLE1
> 30 30 29 42
> EOS
> 
> qual = Bio::FastaNumericFormat.new(FASTA_QUAL)
> bs = qual.to_biosequence
> puts bs.output(:raw)
> =========================
> 
> The last line raise an error:
> 
> =========================
> (eval):2:in `__get__seq': undefined method `seq' for 
> #<Bio::FastaNumericFormat:0x2b182810ceb0> (NoMethodError)
>          from (eval):4:in `seq'
>          from 
> /home/misshie/.gem/ruby/1.8/gems/bio-1.4.0/lib/bio/sequence/format_raw.rb:19:in 
> `output'
>          from 
> /home/misshie/.gem/ruby/1.8/gems/bio-1.4.0/lib/bio/sequence/format.rb:97:in 
> `output'
>          from 
> /home/misshie/.gem/ruby/1.8/gems/bio-1.4.0/lib/bio/sequence/format.rb:172:in 
> `output'
>          from fasta_numeric_format.rb:11
> =========================
> 
> In the last line, using :fasta, :fasta_numeric etc. make same results.
> 
> Please let me know if you have ideas to solve this problem.
> 
> Hiro.
> -- 
> MISHIMA, Hiroyuki, DDS, Ph.D.
> COE Research Fellow
> Department of Human Genetics
> Nagasaki University Graduate School of Biomedical Sciences
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From ngoto at gen-info.osaka-u.ac.jp  Wed Jan 20 13:50:45 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Wed, 20 Jan 2010 22:50:45 +0900
Subject: [BioRuby] Thread-safety of alignment
In-Reply-To: <b9140daa1001200409u24a6cc7dgbc74cf4725b3f4e1@mail.gmail.com>
References: <b9140daa1001200409u24a6cc7dgbc74cf4725b3f4e1@mail.gmail.com>
Message-ID: <20100120135046.158711CBC511@idnmail.gen-info.osaka-u.ac.jp>

Hi,

On Wed, 20 Jan 2010 23:09:19 +1100
Andrew Grimm <andrew.j.grimm at gmail.com> wrote:

> Is alignment intended to be thread-safe in bioruby? If so, should I
> use the same alignment factory between threads, or a separate one in
> each thread?

It is not confirmed to be thread-safe, so it is safe to use
separate one in each thread.

Currently, in BioRuby, manipulating the same object from different
threads is not intended. When manipulating the same object from
different threads is needed, using mutex is recommended.

For library developers, it is encouraged to write thread-safe
code if possible, but not mandatory.

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

> 
> Andrew
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From ktym at hgc.jp  Thu Jan 21 14:05:42 2010
From: ktym at hgc.jp (Toshiaki Katayama)
Date: Thu, 21 Jan 2010 23:05:42 +0900
Subject: [BioRuby] Bioruby HTML output
In-Reply-To: <20100120073644.GA11295@thebird.nl>
References: <5A89DD24-3A2E-4087-A7A5-B99FEDAB9070@kenroku.kanazawa-u.ac.jp>
	<20100115140059.GA24948@thebird.nl>
	<4B515042.7020204@kenroku.kanazawa-u.ac.jp>
	<20100116083041.GA2663@thebird.nl>
	<63924659-337A-4694-B207-A4A2BC887C8A@kenroku.kanazawa-u.ac.jp>
	<20100117135441.GA24341@thebird.nl>
	<20100119105056.GA29525@thebird.nl>
	<0783863C-E6FC-4E9D-8C77-9C2FE88CA9B8@hgc.jp>
	<20100119143422.GA1781@thebird.nl>
	<14D9E166-A983-4E42-A164-2C0BE4EF6535@hgc.jp>
	<20100120073644.GA11295@thebird.nl>
Message-ID: <7B739736-1D0D-43E2-89E8-8F6B4DCC3404@hgc.jp>

Dear Pj,

I looked your code and had a feeling that we should use some template system.
If HTML tags are hard coded in the library as you did, it will be very hard to modify them by the user.

Besides, what version of the HTML specification did you have in mind?
This is my first time to see the <p> tag is used in the form of <p />. Is it valid?
I also think decorations should be separated to the CSS layer and you should avoid to use the <font> tag, especially when you are trying to distribute your code as a part of the library.


As for the file location, I still like the way Naohisa has suggested.
Although, I'm not sure the internal node 'output/html' is necessary for 'bio/alignment/output/html/alignment.rb'.
Anyway, we need to try every approach to learn pros and cons.

With your proposal, we may have a tree like this:

--------------------------------------------------
for bio/alignment.rb and bio/db/kegg/compound.rb and bio/db/genbank.rb ...

bio/output/html/html_alignment.rb (Bio::Html::Alignment)
bio/output/html/html_kegg_compound.rb (Bio::Html::KEGG::COMPOUND)
bio/output/html/html_genbank.rb  (Bio::Html::GenBank)
 :

bio/output/rdf/rdf_kegg_compound.rb (Bio::RDF::KEGG::COMPOUND)
bio/output/rdf/rdf_genbank.rb (Bio::RDF::GenBank)
 :

bio/output/fasta/fasta_genbank.rb (Bio::FASTA::GenBank)
bio/output/fasta/fasta_kegg_genes.rb (Bio::FASTA::KEGG::GENES)
 :

bio/output/gff/gff_genbank.rb (Bio::GFF::GenBank)
 :
--------------------------------------------------

apparently, the class names for output formats conflict with existing classes (e.g. Bio::FASTA, Bio::GFF) and we need to look into each sub directories to find which output format is supported for a particular database.


If we gather templates of output formats along with the database classes:

--------------------------------------------------
for bio/alignment.rb:
bio/alignment/alignment.html.erb
 :

for bio/db/kegg/compound.rb:
bio/db/kegg/compound/compound.rdf.erb
bio/db/kegg/compound/compound.tut.erb
bio/db/kegg/compound/compound.html.erb
 :

for bio/db/genbank.rb:
bio/db/genbank/genbank.rdf.erb
bio/db/genbank/genbank.gff.erb
bio/db/genbank/genbank.html.erb
bio/db/genbank/genbank.fasta.erb
 :
--------------------------------------------------

However, this is still a desk plan and we need to try more (we already started for RDF).

Toshiaki


On 2010/01/20, at 16:36, Pjotr Prins wrote:

> Dear Toshiaki,
> 
> On Wed, Jan 20, 2010 at 01:21:54AM +0900, Toshiaki Katayama wrote:
>>> I really would like all HTML to be in one sub-tree. Also XML, RDF and
>>> whatnot. When it is 'business' logic it should be in database. When it
>>> is output transformations it is not 'business' logic any longer.
>> 
>> I'm not sure about HTML but FASTA and RDF, for example, are tightly
>> related to the original database format/contents. So, I proposed
>> to have methods to generate formatted string in each database class.
>> 
>> There can be many ways to design OO class trees and to find the best
>> way to represent/abstract things is always a difficult task.
> 
> I wrote a nice alignment HTML output generator. Which also displays PAML
> output. Currently it is in bio/output/html/htmlalignment.rb and the
> class is named Bio::Html::Alignment. 
> 
> For the current Bioruby, where do you want to put that? I don't feel
> it should be cluttering alignment.rb. Naohisa has suggested
> bio/alignment/output/html/alignment.rb instead. I feel uncomfortable
> with this. But it is kinda consistent with above, tightly relating it
> to the alignment object.
> 
> What do you think of the class name?
> 
> The code is in my color-alignment branch, see
> 
>  http://github.com/pjotrp/bioruby/tree/color-alignment
> 
> Is anyone else interested in this type of discussion? We can take it
> off-list.
> 
> Pj.


From pjotr.public14 at thebird.nl  Thu Jan 21 16:20:49 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Thu, 21 Jan 2010 17:20:49 +0100
Subject: [BioRuby] Proposal: Bioruby modules (the bazaar)
Message-ID: <20100121162049.GB31462@thebird.nl>

Dear Toshiaki,

On Thu, Jan 21, 2010 at 11:05:42PM +0900, Toshiaki Katayama wrote:
> I looked your code and had a feeling that we should use some
> template system.  If HTML tags are hard coded in the library as you
> did, it will be very hard to modify them by the user.

Aren't we trying to overcomplicate things? This is an HTML generator
- in fact it is embedded HTML as I don't provide the <html>, header or
body parts. It can just be inserted into Rails, or whatever HTML
framework that is out there.

Templating is just another abstraction. I don't intend to template
engines like Rails.

Or, are you here merely referring to using the CGI class (or something
like that).  I guess I could do that, though I have trouble seeing the
benefits. It is just another way of writing HTML statements.

> Besides, what version of the HTML specification did you have in
> mind?
> This is my first time to see the <p> tag is used in the form of <p />. Is it valid?

Yes. It is, in fact, XHTML.

> I also think decorations should be separated to the CSS layer and you should avoid to use the <font> tag, especially when you are trying to distribute your code as a part of the library.

We use hard coded colors. I could use CSS, but then you need to
provide a CSS file (or I need to hard code the header of the file).
That makes it (again) more complicated than necessary. Where do we
store the CSS file, how do we make sure the browser finds it? CSS is
really to adapt look and feel. If the output is meant to be fixed, why
make it flexible?  Besides all (future) browsers support the font tag,
as used. If that stops we could always adapt that source code.

> As for the file location, I still like the way Naohisa has
> suggested.

Alright. I can move the files, if that was all.

However, my colored alignment is not going to make it into Bioruby
this way. There is always something wrong with my code, it appears.
Now I need to move file locations that have not really been decided
on; I need to template HTML - but we haven't decided how and it is
questionable; I need to use CSS, though I think it makes things worse
for users.

Are we really sure you want to reject this code just because it does
not live up to everyone's current and future expectations? It may
still be useful to someone else, you know, it does not break anything
else, and can be improved in the future. Once we decide what we want
to achieve.

The same really holds to my PAML branch and my GEO branch. Both
contain useful utilities for others to use. And now the alignment is
the third pending Bioruby branch.

Can you imagine my growing frustration? Should this go into Bioruby,
or should I start another project, like others have done? Or stick it
into my existing biotools or bigbio projects? Just, so I don't have
the hassle?

The way the Perl people handle it is by having independent modules.
Everyone owns his, or her, own module and Perl's CPAN acts more as an
aggragator. The advantage is that the environment is more dynamic. And
you really don't care what is inside a module. That is up to the
maintainer and his/her users.

We could create independent BioRuby modules, which have their own git
repositories. When a module is nice enough to include in Bioruby make
it a git submodule - I use this technique for biolib - it will
register in the BioRuby repository. That way Bioruby still controls
what goes in a release. However, modules can be maintained for
experimental setups or private use. So my modules would go in

  lib/bio/modules/paml
  lib/bio/modules/geo
  lib/bio/modules/htmlalignment

each its own git repository.

When one of those is 'strong' enough for main line you move it into a
different location in the main repository. Modules could even be
included in Bioruby releases.

What hurts me now is that no one is going to use my code, since I
don't have the time to make it perfect, and it is hidden in my
experimental Bioruby branches. We should find a way to make
'experimental code' available to the rest of the community. That way
we may also 'recruit' help to make the code more perfect. 

Make it easy to allow external modules to become visible through
Bioruby - that is a win-win, as well as a more bazaar-like approach
to OSS development.

I wonder how many people on this list would contribute code if it was
more loosely organised.

Pj.


From ktym at hgc.jp  Thu Jan 21 17:54:24 2010
From: ktym at hgc.jp (Toshiaki Katayama)
Date: Fri, 22 Jan 2010 02:54:24 +0900
Subject: [BioRuby] Proposal: Bioruby modules (the bazaar)
In-Reply-To: <20100121162049.GB31462@thebird.nl>
References: <20100121162049.GB31462@thebird.nl>
Message-ID: <DF2DA732-ABF3-492E-893A-FB5E06AA5289@hgc.jp>

Dear Pj,

I can understand your frustration and I like your idea of the
'module' system, as it reminds me the way how the Linux kernel
tree is successfully maintained.

> I wonder how many people on this list would contribute code if it was
> more loosely organised.

Indeed.

However, I think our move from cvs to git was already a great step
that it opened large opportunity to all those who want to participate
in development. Before doing that, "open source" project not always
mean "open to join" project.

Now, everyone can easily fork the project and release their modified
codes as you already done. So, we may able to evaluate from the current
situation that how many other people have tried.

Anyway, it is still a difficult problem that who will decide and
how to decide when to migrate the contributed code into the main tree.
It might sound like a excuse, but I'm also suffering from the difficulty.
I also have several modules which are not yet contributed to the main tree.
For example, my SGE library for BioRuby (http://kanehisa.hgc.jp/~k/sge/)
because I'm not sure it is general enough and where it fits.


As for the HTML portion, I see your point.

* I'd like to hear comments from others.
* How people like to render/visualize the BioRuby objects (especially in HTML)?
* I didn't mean to use the CGI class for HTML generation (I even don't like that).
* The use of <p /> seems invalid in XHTML. See http://www.w3.org/TR/xhtml1/#C_3


P.S.
Once, I had developed a mechanism to integrate end-user code snippets
in the BioRuby shell, called plugins. I wrote some plugins which render
a colored codon table, a formatted summary of sequence properties etc.

If those and functions defined in your plugins can be easily accessed by

  puts Bio.your_function_name(options)

or something like that, is it satisfy your needs?

If so, we can consider to make a repository for such plugins and bundle
them in the BioRuby as well.

Regards,
Toshiaki Katayama


On 2010/01/22, at 1:20, Pjotr Prins wrote:

> Dear Toshiaki,
> 
> On Thu, Jan 21, 2010 at 11:05:42PM +0900, Toshiaki Katayama wrote:
>> I looked your code and had a feeling that we should use some
>> template system.  If HTML tags are hard coded in the library as you
>> did, it will be very hard to modify them by the user.
> 
> Aren't we trying to overcomplicate things? This is an HTML generator
> - in fact it is embedded HTML as I don't provide the <html>, header or
> body parts. It can just be inserted into Rails, or whatever HTML
> framework that is out there.
> 
> Templating is just another abstraction. I don't intend to template
> engines like Rails.
> 
> Or, are you here merely referring to using the CGI class (or something
> like that).  I guess I could do that, though I have trouble seeing the
> benefits. It is just another way of writing HTML statements.
> 
>> Besides, what version of the HTML specification did you have in
>> mind?
>> This is my first time to see the <p> tag is used in the form of <p />. Is it valid?
> 
> Yes. It is, in fact, XHTML.
> 
>> I also think decorations should be separated to the CSS layer and you should avoid to use the <font> tag, especially when you are trying to distribute your code as a part of the library.
> 
> We use hard coded colors. I could use CSS, but then you need to
> provide a CSS file (or I need to hard code the header of the file).
> That makes it (again) more complicated than necessary. Where do we
> store the CSS file, how do we make sure the browser finds it? CSS is
> really to adapt look and feel. If the output is meant to be fixed, why
> make it flexible?  Besides all (future) browsers support the font tag,
> as used. If that stops we could always adapt that source code.
> 
>> As for the file location, I still like the way Naohisa has
>> suggested.
> 
> Alright. I can move the files, if that was all.
> 
> However, my colored alignment is not going to make it into Bioruby
> this way. There is always something wrong with my code, it appears.
> Now I need to move file locations that have not really been decided
> on; I need to template HTML - but we haven't decided how and it is
> questionable; I need to use CSS, though I think it makes things worse
> for users.
> 
> Are we really sure you want to reject this code just because it does
> not live up to everyone's current and future expectations? It may
> still be useful to someone else, you know, it does not break anything
> else, and can be improved in the future. Once we decide what we want
> to achieve.
> 
> The same really holds to my PAML branch and my GEO branch. Both
> contain useful utilities for others to use. And now the alignment is
> the third pending Bioruby branch.
> 
> Can you imagine my growing frustration? Should this go into Bioruby,
> or should I start another project, like others have done? Or stick it
> into my existing biotools or bigbio projects? Just, so I don't have
> the hassle?
> 
> The way the Perl people handle it is by having independent modules.
> Everyone owns his, or her, own module and Perl's CPAN acts more as an
> aggragator. The advantage is that the environment is more dynamic. And
> you really don't care what is inside a module. That is up to the
> maintainer and his/her users.
> 
> We could create independent BioRuby modules, which have their own git
> repositories. When a module is nice enough to include in Bioruby make
> it a git submodule - I use this technique for biolib - it will
> register in the BioRuby repository. That way Bioruby still controls
> what goes in a release. However, modules can be maintained for
> experimental setups or private use. So my modules would go in
> 
>  lib/bio/modules/paml
>  lib/bio/modules/geo
>  lib/bio/modules/htmlalignment
> 
> each its own git repository.
> 
> When one of those is 'strong' enough for main line you move it into a
> different location in the main repository. Modules could even be
> included in Bioruby releases.
> 
> What hurts me now is that no one is going to use my code, since I
> don't have the time to make it perfect, and it is hidden in my
> experimental Bioruby branches. We should find a way to make
> 'experimental code' available to the rest of the community. That way
> we may also 'recruit' help to make the code more perfect. 
> 
> Make it easy to allow external modules to become visible through
> Bioruby - that is a win-win, as well as a more bazaar-like approach
> to OSS development.
> 
> I wonder how many people on this list would contribute code if it was
> more loosely organised.
> 
> Pj.


From yannick.wurm at unil.ch  Thu Jan 21 18:21:40 2010
From: yannick.wurm at unil.ch (Yannick Wurm)
Date: Thu, 21 Jan 2010 19:21:40 +0100
Subject: [BioRuby] Proposal: Bioruby modules (the bazaar)
In-Reply-To: <mailman.17.1264093205.11305.bioruby@lists.open-bio.org>
References: <mailman.17.1264093205.11305.bioruby@lists.open-bio.org>
Message-ID: <EC30B4CB-86D0-4FC0-97D7-FDF86F697AF8@unil.ch>

On 21 Jan 2010, at 18:00, bioruby-request at lists.open-bio.org wrote:

> re we really sure you want to reject this code just because it does
> not live up to everyone's current and future expectations? It may
> still be useful to someone else, you know, it does not break anything
> else, and can be improved in the future. Once we decide what we want
> to achieve.

> 
> What hurts me now is that no one is going to use my code, since I
> don't have the time to make it perfect, and it is hidden in my
> experimental Bioruby branches. We should find a way to make
> 'experimental code' available to the rest of the community. That way
> we may also 'recruit' help to make the code more perfect. 


I agree 100% that enthusiastic bioruby improvements like Pjotr's should be encouraged & given maximal visibility.
It's better to have great tools with room for improvement than no tools. 
(a year or two ago I needed colored html alignments and ended up with an ugly, ugly hack that used t_coffee to generate html output from the alignments I'd generated elsewhere - something like Pjotr's code would have been much more elegant)

I also have the feeling that code contributions in general are given more negative than positive feedback on this list. I believe it's a grave mistake because the bioruby community will not grow without passionate users & contibutors and more quality code.

just my two cents,

yannick

--------------------------------------------
          yannick . wurm @ unil . ch
Ant Genomics, Ecology & Evolution @ Lausanne
   http://www.unil.ch/dee/page28685_fr.html


From pjotr.public14 at thebird.nl  Fri Jan 22 08:55:08 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Fri, 22 Jan 2010 09:55:08 +0100
Subject: [BioRuby] Proposal: Bioruby modules (the bazaar)
In-Reply-To: <DF2DA732-ABF3-492E-893A-FB5E06AA5289@hgc.jp>
References: <20100121162049.GB31462@thebird.nl>
	<DF2DA732-ABF3-492E-893A-FB5E06AA5289@hgc.jp>
Message-ID: <20100122085508.GB12248@thebird.nl>

On Fri, Jan 22, 2010 at 02:54:24AM +0900, Toshiaki Katayama wrote:
> Dear Pj,
> 
> I can understand your frustration and I like your idea of the
> 'module' system, as it reminds me the way how the Linux kernel
> tree is successfully maintained.

Thinking about it there are other good examples. The R language
supports modules in CRAN - similar in many ways to generic Perl CPAN
and Ruby's gems. But, on top of CRAN they also have Bioconductor which
aggregates Bio related modules. The main benefit is that it
pre-packages all Bio related packages and people can load it on the
fly. See http://www.bioconductor.org/

We don't want to replace gems - but I think the gem system is too
loose for most people, and it requires every module to understand and
comply with the gem system.

I think Bioruby can play a role here. We can have modules (or
plugins, like Rails has) that come either with Bioruby's
installation, or get installed on request. If we find a syntax for
that it would be great. E.g.

  Bio::Module.load(:html_alignment)

If it is part of Bioruby, pass. Otherwise throw error: 

"Bio::Module :html_alignment not installed, try Bio::Module.install(:html_alignment)"

  Bio::Module.install(:html_alignment)

will search the definition and install it. Depending on the module it
can be installed as a gem, or fetched through git or a tarball (an
optional parameter can overrule behaviour). On success one can start
as either function will prepare for:

  html_aln = Bio::Html::Alignment.new('my.aln')

The nice thing about this setup is that

(1) It is really easy on the user

(2) Decouples the module from Bioruby - all issues are between the
users and the module maintainer - discussions can still be on the
main mailing list

(3) Retains some control on what modules are allowed in, an what not

(4) Modules can be obsoleted

(5) Modules can be updated outside Bioruby's mainline. e.g. Bio::Module.install(:html_alignment,:development=>true)

Pj.


From tomoakin at kenroku.kanazawa-u.ac.jp  Fri Jan 22 09:12:29 2010
From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA)
Date: Fri, 22 Jan 2010 18:12:29 +0900
Subject: [BioRuby] Proposal: Bioruby modules (the bazaar)
In-Reply-To: <DF2DA732-ABF3-492E-893A-FB5E06AA5289@hgc.jp>
References: <20100121162049.GB31462@thebird.nl>
	<DF2DA732-ABF3-492E-893A-FB5E06AA5289@hgc.jp>
Message-ID: <066BB141-7217-4343-85B4-165072A58E06@kenroku.kanazawa-u.ac.jp>

Hi,

> As for the HTML portion, I see your point.
>
> * I'd like to hear comments from others.
> * How people like to render/visualize the BioRuby objects  
> (especially in HTML)?
> * I didn't mean to use the CGI class for HTML generation (I even  
> don't like that).


Perhaps the way to render the objects depends on both objects and  
purposes,
but if the object has a string representation, just showing them is  
perhaps
a good default. Also defining the way how to represent in HTML or any
other format for all classes comprehensively is too laborious as the  
first step
and a way to allow gradual growth of the codebase seems good.
It is the way flatfile parser grew to support many formats.

Thus, mechanism to do class specific conversion and default  
conversion for
non HTML aware classes is good.
Criticism on 'cgi' library for the default
conversion CGI.escapeHTML(object.to_s), especially for the name
is understandable.
There are already criticism on CGI.rb in itself
<http://jp.rubyist.net/magazine/?0023-Cgirb>
but there are no *standard* alternatives yet.
Perhaps we can just copy or rewrite the escapeHTML code
and make it any name that fits our purpose.

A drawback of having our escapeHTML code is that it could be
redundant in many cases where html generation is for CGI, and
we cannot get benefit from CGIAlt or any other compatible speedup
library on CGI, rewrite or extension with C. But I think this is
not a very large problem.

Making require 'bio' automatically loading cgi.rb is undesirable.
If the html code is not automatically loaded by require 'bio'
but loaded only another call require 'bio/html', then
I feel 'bio/html' loading cgi.rb is in a reasonable range.

Capability to use style instead of directly specifying color and
font is desirable since it could reduce the output size, and
possibly readability.
Nontheless, this is not mandatory and the first implementation
with direct specification is ok.

-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan


On 2010/01/22, at 2:54, Toshiaki Katayama wrote:

> Dear Pj,
>
> I can understand your frustration and I like your idea of the
> 'module' system, as it reminds me the way how the Linux kernel
> tree is successfully maintained.
>
>> I wonder how many people on this list would contribute code if it was
>> more loosely organised.
>
> Indeed.
>
> However, I think our move from cvs to git was already a great step
> that it opened large opportunity to all those who want to participate
> in development. Before doing that, "open source" project not always
> mean "open to join" project.
>
> Now, everyone can easily fork the project and release their modified
> codes as you already done. So, we may able to evaluate from the  
> current
> situation that how many other people have tried.
>
> Anyway, it is still a difficult problem that who will decide and
> how to decide when to migrate the contributed code into the main tree.
> It might sound like a excuse, but I'm also suffering from the  
> difficulty.
> I also have several modules which are not yet contributed to the  
> main tree.
> For example, my SGE library for BioRuby (http://kanehisa.hgc.jp/~k/ 
> sge/)
> because I'm not sure it is general enough and where it fits.
>
>
> As for the HTML portion, I see your point.
>
> * I'd like to hear comments from others.
> * How people like to render/visualize the BioRuby objects  
> (especially in HTML)?
> * I didn't mean to use the CGI class for HTML generation (I even  
> don't like that).
> * The use of <p /> seems invalid in XHTML. See http://www.w3.org/TR/ 
> xhtml1/#C_3
>
>
> P.S.
> Once, I had developed a mechanism to integrate end-user code snippets
> in the BioRuby shell, called plugins. I wrote some plugins which  
> render
> a colored codon table, a formatted summary of sequence properties etc.
>
> If those and functions defined in your plugins can be easily  
> accessed by
>
>   puts Bio.your_function_name(options)
>
> or something like that, is it satisfy your needs?
>
> If so, we can consider to make a repository for such plugins and  
> bundle
> them in the BioRuby as well.
>
> Regards,
> Toshiaki Katayama
>
>
> On 2010/01/22, at 1:20, Pjotr Prins wrote:
>
>> Dear Toshiaki,
>>
>> On Thu, Jan 21, 2010 at 11:05:42PM +0900, Toshiaki Katayama wrote:
>>> I looked your code and had a feeling that we should use some
>>> template system.  If HTML tags are hard coded in the library as you
>>> did, it will be very hard to modify them by the user.
>>
>> Aren't we trying to overcomplicate things? This is an HTML generator
>> - in fact it is embedded HTML as I don't provide the <html>,  
>> header or
>> body parts. It can just be inserted into Rails, or whatever HTML
>> framework that is out there.
>>
>> Templating is just another abstraction. I don't intend to template
>> engines like Rails.
>>
>> Or, are you here merely referring to using the CGI class (or  
>> something
>> like that).  I guess I could do that, though I have trouble seeing  
>> the
>> benefits. It is just another way of writing HTML statements.
>>
>>> Besides, what version of the HTML specification did you have in
>>> mind?
>>> This is my first time to see the <p> tag is used in the form of  
>>> <p />. Is it valid?
>>
>> Yes. It is, in fact, XHTML.
>>
>>> I also think decorations should be separated to the CSS layer and  
>>> you should avoid to use the <font> tag, especially when you are  
>>> trying to distribute your code as a part of the library.
>>
>> We use hard coded colors. I could use CSS, but then you need to
>> provide a CSS file (or I need to hard code the header of the file).
>> That makes it (again) more complicated than necessary. Where do we
>> store the CSS file, how do we make sure the browser finds it? CSS is
>> really to adapt look and feel. If the output is meant to be fixed,  
>> why
>> make it flexible?  Besides all (future) browsers support the font  
>> tag,
>> as used. If that stops we could always adapt that source code.
>>
>>> As for the file location, I still like the way Naohisa has
>>> suggested.
>>
>> Alright. I can move the files, if that was all.
>>
>> However, my colored alignment is not going to make it into Bioruby
>> this way. There is always something wrong with my code, it appears.
>> Now I need to move file locations that have not really been decided
>> on; I need to template HTML - but we haven't decided how and it is
>> questionable; I need to use CSS, though I think it makes things worse
>> for users.
>>
>> Are we really sure you want to reject this code just because it does
>> not live up to everyone's current and future expectations? It may
>> still be useful to someone else, you know, it does not break anything
>> else, and can be improved in the future. Once we decide what we want
>> to achieve.
>>
>> The same really holds to my PAML branch and my GEO branch. Both
>> contain useful utilities for others to use. And now the alignment is
>> the third pending Bioruby branch.
>>
>> Can you imagine my growing frustration? Should this go into Bioruby,
>> or should I start another project, like others have done? Or stick it
>> into my existing biotools or bigbio projects? Just, so I don't have
>> the hassle?
>>
>> The way the Perl people handle it is by having independent modules.
>> Everyone owns his, or her, own module and Perl's CPAN acts more as an
>> aggragator. The advantage is that the environment is more dynamic.  
>> And
>> you really don't care what is inside a module. That is up to the
>> maintainer and his/her users.
>>
>> We could create independent BioRuby modules, which have their own git
>> repositories. When a module is nice enough to include in Bioruby make
>> it a git submodule - I use this technique for biolib - it will
>> register in the BioRuby repository. That way Bioruby still controls
>> what goes in a release. However, modules can be maintained for
>> experimental setups or private use. So my modules would go in
>>
>>  lib/bio/modules/paml
>>  lib/bio/modules/geo
>>  lib/bio/modules/htmlalignment
>>
>> each its own git repository.
>>
>> When one of those is 'strong' enough for main line you move it into a
>> different location in the main repository. Modules could even be
>> included in Bioruby releases.
>>
>> What hurts me now is that no one is going to use my code, since I
>> don't have the time to make it perfect, and it is hidden in my
>> experimental Bioruby branches. We should find a way to make
>> 'experimental code' available to the rest of the community. That way
>> we may also 'recruit' help to make the code more perfect.
>>
>> Make it easy to allow external modules to become visible through
>> Bioruby - that is a win-win, as well as a more bazaar-like approach
>> to OSS development.
>>
>> I wonder how many people on this list would contribute code if it was
>> more loosely organised.
>>
>> Pj.
>
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>


From jan.aerts at gmail.com  Fri Jan 22 09:34:43 2010
From: jan.aerts at gmail.com (Jan Aerts)
Date: Fri, 22 Jan 2010 09:34:43 +0000
Subject: [BioRuby] Proposal: Bioruby modules (the bazaar)
In-Reply-To: <EC30B4CB-86D0-4FC0-97D7-FDF86F697AF8@unil.ch>
References: <mailman.17.1264093205.11305.bioruby@lists.open-bio.org>
	<EC30B4CB-86D0-4FC0-97D7-FDF86F697AF8@unil.ch>
Message-ID: <4c7507a71001220134j3eecf626y90755ddd919336e4@mail.gmail.com>

Hear, hear... Exactly my feelings as well.

j.

2010/1/21 Yannick Wurm <yannick.wurm at unil.ch>

> On 21 Jan 2010, at 18:00, bioruby-request at lists.open-bio.org wrote:
>
> > re we really sure you want to reject this code just because it does
> > not live up to everyone's current and future expectations? It may
> > still be useful to someone else, you know, it does not break anything
> > else, and can be improved in the future. Once we decide what we want
> > to achieve.
>
> >
> > What hurts me now is that no one is going to use my code, since I
> > don't have the time to make it perfect, and it is hidden in my
> > experimental Bioruby branches. We should find a way to make
> > 'experimental code' available to the rest of the community. That way
> > we may also 'recruit' help to make the code more perfect.
>
>
> I agree 100% that enthusiastic bioruby improvements like Pjotr's should be
> encouraged & given maximal visibility.
> It's better to have great tools with room for improvement than no tools.
> (a year or two ago I needed colored html alignments and ended up with an
> ugly, ugly hack that used t_coffee to generate html output from the
> alignments I'd generated elsewhere - something like Pjotr's code would have
> been much more elegant)
>
> I also have the feeling that code contributions in general are given more
> negative than positive feedback on this list. I believe it's a grave mistake
> because the bioruby community will not grow without passionate users &
> contibutors and more quality code.
>
> just my two cents,
>
> yannick
>
> --------------------------------------------
>          yannick . wurm @ unil . ch
> Ant Genomics, Ecology & Evolution @ Lausanne
>   http://www.unil.ch/dee/page28685_fr.html
>
>
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>


From tomoakin at kenroku.kanazawa-u.ac.jp  Fri Jan 22 09:48:20 2010
From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA)
Date: Fri, 22 Jan 2010 18:48:20 +0900
Subject: [BioRuby] Proposal: Bioruby modules (the bazaar)
In-Reply-To: <20100122085508.GB12248@thebird.nl>
References: <20100121162049.GB31462@thebird.nl>
	<DF2DA732-ABF3-492E-893A-FB5E06AA5289@hgc.jp>
	<20100122085508.GB12248@thebird.nl>
Message-ID: <1B20D151-8246-47B1-8D3B-F319D66FF92F@kenroku.kanazawa-u.ac.jp>

Hi,

>   Bio::Module.load(:html_alignment)

What is the benefit over
require 'bio/html_alignment' # no autoload by require 'bio'
?

>   Bio::Module.install(:html_alignment)
>
> will search the definition and install it.


I feel installation is easier from shell like:
$ ruby bioruby-inst-module html_alignment
but calling the Module.install internally is fine.

> (5) Modules can be updated outside Bioruby's mainline. e.g.  
> Bio::Module.install(:html_alignment,:development=>true)

We need to have a mechanism to check the versions between
the standard bioruby and the modules. Especially when the
mainline bioruby is updated.  Different modules perhaps will
have different level of dependency on the bioruby code, and
update in the main bioruby code sometimes may break the old
module.

-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan


From pjotr.public14 at thebird.nl  Fri Jan 22 10:49:00 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Fri, 22 Jan 2010 11:49:00 +0100
Subject: [BioRuby] Proposal: Bioruby modules (the bazaar)
In-Reply-To: <1B20D151-8246-47B1-8D3B-F319D66FF92F@kenroku.kanazawa-u.ac.jp>
References: <20100121162049.GB31462@thebird.nl>
	<DF2DA732-ABF3-492E-893A-FB5E06AA5289@hgc.jp>
	<20100122085508.GB12248@thebird.nl>
	<1B20D151-8246-47B1-8D3B-F319D66FF92F@kenroku.kanazawa-u.ac.jp>
Message-ID: <20100122104900.GB15628@thebird.nl>

On Fri, Jan 22, 2010 at 06:48:20PM +0900, Tomoaki NISHIYAMA wrote:
>>   Bio::Module.load(:html_alignment)
>
> What is the benefit over
> require 'bio/html_alignment' # no autoload by require 'bio'
> ?

A method allows more checking. I presume the module information will
be somewhere in a YAML file in the main tree. Or maintained through
git submodules.

>>   Bio::Module.install(:html_alignment)
>>
>> will search the definition and install it.
>
> I feel installation is easier from shell like:
> $ ruby bioruby-inst-module html_alignment
> but calling the Module.install internally is fine.

My example is for an interactive session. You only do it once (I
hope). Or when an author says he has updated his module.

>> (5) Modules can be updated outside Bioruby's mainline. e.g.  
>> Bio::Module.install(:html_alignment,:development=>true)
>
> We need to have a mechanism to check the versions between
> the standard bioruby and the modules. Especially when the
> mainline bioruby is updated.  Different modules perhaps will
> have different level of dependency on the bioruby code, and
> update in the main bioruby code sometimes may break the old
> module.

Well.

Bioruby should not care.

I think you misunderstand the purpose. Modules are *not* to be
supported from Bioruby. It is only a mechanism to make them easily
available. If things break, they break. That is why it is
developmental, or experimental.

The modules that are well 'supported' will come inside the
distribution.  Outside modules are up to the module maintainer.

Besides, you don't want to replace gems. If an author wants versioning
he can provide a gem (which, again, can be loaded as a Bioruby
module).

Once a module goes main stream versioning is moot. It just becomes
part of the Bioruby tree.

When everyone understands this a module can still support versioning.
But I think that ought to be done through gems.

Pj.


From andrew.j.grimm at gmail.com  Tue Jan 26 12:12:35 2010
From: andrew.j.grimm at gmail.com (Andrew Grimm)
Date: Tue, 26 Jan 2010 23:12:35 +1100
Subject: [BioRuby] Thread-safety of alignment
In-Reply-To: <20100120135046.158711CBC511@idnmail.gen-info.osaka-u.ac.jp>
References: <b9140daa1001200409u24a6cc7dgbc74cf4725b3f4e1@mail.gmail.com>
	<20100120135046.158711CBC511@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <b9140daa1001260412p7fc8582dgb87861906854494@mail.gmail.com>

Hi Naohisa Goto,

I tried creating a new factory in each thread, but I sometimes (but
not always) have errors.

Is the code in http://github.com/agrimm/bioruby-alignment-threading-replication/blob/master/test/test_multithreaded_alignment.rb
correct? Does it cause problems for anyone else?

Some of the errors I get include the ones seen at http://gist.github.com/286775

It's possible that the issues are caused by problems in tempfile
itself (which may have been fixed in August 2009 according to the
changelog).

Thanks,

Andrew

On Thu, Jan 21, 2010 at 12:50 AM, Naohisa GOTO
<ngoto at gen-info.osaka-u.ac.jp> wrote:
> Hi,
>
> On Wed, 20 Jan 2010 23:09:19 +1100
> Andrew Grimm <andrew.j.grimm at gmail.com> wrote:
>
>> Is alignment intended to be thread-safe in bioruby? If so, should I
>> use the same alignment factory between threads, or a separate one in
>> each thread?
>
> It is not confirmed to be thread-safe, so it is safe to use
> separate one in each thread.
>
> Currently, in BioRuby, manipulating the same object from different
> threads is not intended. When manipulating the same object from
> different threads is needed, using mutex is recommended.
>
> For library developers, it is encouraged to write thread-safe
> code if possible, but not mandatory.
>
> Naohisa Goto
> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
>
>>
>> Andrew
>> _______________________________________________
>> BioRuby Project - http://www.bioruby.org/
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
>
>


From ngoto at gen-info.osaka-u.ac.jp  Tue Jan 26 15:00:04 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Wed, 27 Jan 2010 00:00:04 +0900
Subject: [BioRuby] Thread-safety of alignment
In-Reply-To: <b9140daa1001260412p7fc8582dgb87861906854494@mail.gmail.com>
References: <b9140daa1001200409u24a6cc7dgbc74cf4725b3f4e1@mail.gmail.com>
	<20100120135046.158711CBC511@idnmail.gen-info.osaka-u.ac.jp>
	<b9140daa1001260412p7fc8582dgb87861906854494@mail.gmail.com>
Message-ID: <20100126150004.8B4AA1CBC3EC@idnmail.gen-info.osaka-u.ac.jp>

Hi Andrew,

On Tue, 26 Jan 2010 23:12:35 +1100
Andrew Grimm <andrew.j.grimm at gmail.com> wrote:

> Hi Naohisa Goto,
> 
> I tried creating a new factory in each thread, but I sometimes (but
> not always) have errors.

Please show ruby version and BioRuby version.
 % ruby -v
 % ruby -rbio -e 'puts Bio::BIORUBY_VERSION_ID'
(If you are using BioRuby 1.2.1 or earlier, 
 % ruby -rbio -e 'p Bio::BIORUBY_VERSION'
)

> Is the code in http://github.com/agrimm/bioruby-alignment-threading-replication/blob/master/test/test_multithreaded_alignment.rb
> correct? Does it cause problems for anyone else?

The "rescue RuntimeError" in line 15 may hide problems.
In my environment, it seems that the RuntimeError is raised
in lib/bio/alignment.rb. The error message I observed
without the rescue was
"alignment result is inconsistent with input data",
and output file created by Clustalw was unexpectedly empty.
It might be a bug of Tempfile in Ruby, but not sure.

With Ruby 1.8.7, errors are observed in some times.
  % ruby -v
  ruby 1.8.7 (2010-01-10 patchlevel 249) [i686-linux]
  ruby 1.8.7 (2009-04-08 patchlevel 160) [i686-linux]
  ruby 1.8.7 (2008-08-11 patchlevel 72) [i486-linux]

With Ruby 1.9.1-p378, no errors when I executed several times.
  % ruby -v
  ruby 1.9.1p378 (2010-01-10 revision 26273) [i686-linux]

> Some of the errors I get include the ones seen at http://gist.github.com/286775

The message "ERROR: Multiple sequences found with same name
(found 0 at least twice)!" is reported by ClustalW, and
it indicates incorrect input file sequence names. Maybe
two file contents are unexpectedly concatenated or mixed
possibly due to a bug of Tempfile, but not sure.

> It's possible that the issues are caused by problems in tempfile
> itself (which may have been fixed in August 2009 according to the
> changelog).

Another possibility is resource limits of the machine:
the number of child processes, total memory size, etc.
If exceeding limits, new child clustalw process could
not be started, or running clustalw processes might be
killed. This also causes void or truncated result files,
and leads to ruby-level errors.

Thanks,

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

> 
> Thanks,
> 
> Andrew
> 
> On Thu, Jan 21, 2010 at 12:50 AM, Naohisa GOTO
> <ngoto at gen-info.osaka-u.ac.jp> wrote:
> > Hi,
> >
> > On Wed, 20 Jan 2010 23:09:19 +1100
> > Andrew Grimm <andrew.j.grimm at gmail.com> wrote:
> >
> >> Is alignment intended to be thread-safe in bioruby? If so, should I
> >> use the same alignment factory between threads, or a separate one in
> >> each thread?
> >
> > It is not confirmed to be thread-safe, so it is safe to use
> > separate one in each thread.
> >
> > Currently, in BioRuby, manipulating the same object from different
> > threads is not intended. When manipulating the same object from
> > different threads is needed, using mutex is recommended.
> >
> > For library developers, it is encouraged to write thread-safe
> > code if possible, but not mandatory.
> >
> > Naohisa Goto
> > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
> >
> >>
> >> Andrew
> >> _______________________________________________
> >> BioRuby Project - http://www.bioruby.org/
> >> BioRuby mailing list
> >> BioRuby at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioruby
> >
> >


From andrew.j.grimm at gmail.com  Wed Jan 27 03:07:18 2010
From: andrew.j.grimm at gmail.com (Andrew Grimm)
Date: Wed, 27 Jan 2010 14:07:18 +1100
Subject: [BioRuby] Thread-safety of alignment
In-Reply-To: <20100126150004.8B4AA1CBC3EC@idnmail.gen-info.osaka-u.ac.jp>
References: <b9140daa1001200409u24a6cc7dgbc74cf4725b3f4e1@mail.gmail.com>
	<20100120135046.158711CBC511@idnmail.gen-info.osaka-u.ac.jp>
	<b9140daa1001260412p7fc8582dgb87861906854494@mail.gmail.com>
	<20100126150004.8B4AA1CBC3EC@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <b9140daa1001261907v674bd586ha96950baf87ff66e@mail.gmail.com>

Hi Naohisa Goto,

On Wed, Jan 27, 2010 at 2:00 AM, Naohisa GOTO
<ngoto at gen-info.osaka-u.ac.jp> wrote:
> Hi Andrew,
>
> On Tue, 26 Jan 2010 23:12:35 +1100
> Andrew Grimm <andrew.j.grimm at gmail.com> wrote:
>
>> Hi Naohisa Goto,
>>
>> I tried creating a new factory in each thread, but I sometimes (but
>> not always) have errors.
>
> Please show ruby version and BioRuby version.
> ?% ruby -v
> ?% ruby -rbio -e 'puts Bio::BIORUBY_VERSION_ID'
> (If you are using BioRuby 1.2.1 or earlier,
> ?% ruby -rbio -e 'p Bio::BIORUBY_VERSION'
> )
>

I'm running ruby 1.8.7 (2008-08-11 patchlevel 72) and bioruby 1.4.0.

>> Is the code in http://github.com/agrimm/bioruby-alignment-threading-replication/blob/master/test/test_multithreaded_alignment.rb
>> correct? Does it cause problems for anyone else?
>
> The "rescue RuntimeError" in line 15 may hide problems.
> In my environment, it seems that the RuntimeError is raised
> in lib/bio/alignment.rb. The error message I observed
> without the rescue was
> "alignment result is inconsistent with input data",
> and output file created by Clustalw was unexpectedly empty.
> It might be a bug of Tempfile in Ruby, but not sure.
>
> With Ruby 1.8.7, errors are observed in some times.
> ?% ruby -v
> ?ruby 1.8.7 (2010-01-10 patchlevel 249) [i686-linux]
> ?ruby 1.8.7 (2009-04-08 patchlevel 160) [i686-linux]
> ?ruby 1.8.7 (2008-08-11 patchlevel 72) [i486-linux]
>
> With Ruby 1.9.1-p378, no errors when I executed several times.
> ?% ruby -v
> ?ruby 1.9.1p378 (2010-01-10 revision 26273) [i686-linux]
>

I suspect errors may occur on earlier versions of ruby 1.9.1.

>> Some of the errors I get include the ones seen at http://gist.github.com/286775
>
> The message "ERROR: Multiple sequences found with same name
> (found 0 at least twice)!" is reported by ClustalW, and
> it indicates incorrect input file sequence names. Maybe
> two file contents are unexpectedly concatenated or mixed
> possibly due to a bug of Tempfile, but not sure.
>
>> It's possible that the issues are caused by problems in tempfile
>> itself (which may have been fixed in August 2009 according to the
>> changelog).
>
> Another possibility is resource limits of the machine:
> the number of child processes, total memory size, etc.
> If exceeding limits, new child clustalw process could
> not be started, or running clustalw processes might be
> killed. This also causes void or truncated result files,
> and leads to ruby-level errors.
>

Thanks for that suggestion. I re-ran the test using only 5 threads in
the new gist http://gist.github.com/287499

> Thanks,
>
> Naohisa Goto
> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
>
>>
>> Thanks,
>>
>> Andrew
>>
>> On Thu, Jan 21, 2010 at 12:50 AM, Naohisa GOTO
>> <ngoto at gen-info.osaka-u.ac.jp> wrote:
>> > Hi,
>> >
>> > On Wed, 20 Jan 2010 23:09:19 +1100
>> > Andrew Grimm <andrew.j.grimm at gmail.com> wrote:
>> >
>> >> Is alignment intended to be thread-safe in bioruby? If so, should I
>> >> use the same alignment factory between threads, or a separate one in
>> >> each thread?
>> >
>> > It is not confirmed to be thread-safe, so it is safe to use
>> > separate one in each thread.
>> >
>> > Currently, in BioRuby, manipulating the same object from different
>> > threads is not intended. When manipulating the same object from
>> > different threads is needed, using mutex is recommended.
>> >
>> > For library developers, it is encouraged to write thread-safe
>> > code if possible, but not mandatory.
>> >
>> > Naohisa Goto
>> > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
>> >
>> >>
>> >> Andrew
>> >> _______________________________________________
>> >> BioRuby Project - http://www.bioruby.org/
>> >> BioRuby mailing list
>> >> BioRuby at lists.open-bio.org
>> >> http://lists.open-bio.org/mailman/listinfo/bioruby
>> >
>> >
>
>


From missy at be.to  Fri Jan 29 06:46:15 2010
From: missy at be.to (MISHIMA, Hiroyuki)
Date: Fri, 29 Jan 2010 15:46:15 +0900
Subject: [BioRuby] Proposal: Bio::FastaFormat#each_entry
Message-ID: <4B628437.30305@be.to>

Hi all,

How about implementing the following methods?

	Bio::FastaFormat#each_entry
	Bio::FastaNumericFormat#each_entry

The following is a sample code to generate a FASTQ string from a FASTA 
string and a FASTA.QUAL string. This sample may need ruby 1.8.7 or later.

I am afraid that simpler or easier ways are already existed in BioRuby...

Hiro.

-----
#!/usr/local/bin/ruby
require 'rubygems'
require 'bio'

module Bio
   class FastaFormat
     def each_entry
       return to_enum(:each_entry) unless block_given?
       @continue = self.dup
       loop do
         yield @continue
         overrun = @continue.entry_overrun
         break unless overrun
         @continue = Bio::FastaFormat.new(overrun)
       end
     end
   end

   class FastaNumericFormat
     def each_entry
       return to_enum(:each_entry) unless block_given?
       @continue = self.dup
       loop do
         yield @continue
         overrun = @continue.entry_overrun
         break unless overrun
         @continue = Bio::FastaNumericFormat.new(overrun)
       end
     end
   end
end

fasta = <<EOS
>FXQB1I00000001
TATGGAATCTGTAGAATCAGTGGTAGGTGCAGCAGATGGAGGAAGG
>FXQB1I00000002
CTGGAGAATTCTGGATCCTCGACTTATGACTTGGTGGTTCTGGTAACTGTGAGCTTAGGATAGTCAG
EOS

qual = <<EOS
>FXQB1I00000001
30 30 29 42 25 24 5 30 30 30 30 30 28 30 26 9 30 30 30 30 30 42 25 30 30 
42 25 29 22 30 29 26 30 30 30 29 30 42 25 30 32 17 40 23 39 24
>FXQB1I00000002
30 30 33 19 28 30 26 9 32 12 30 30 33 20 30 30 32 15 27 27 30 28 28 34 
22 27 22 28 28 29 26 9 33 19 22 43 25 33 19 28 27 32 15 30 32 12 28 30 
27 30 30 26 27 30 40 23 30 40 23 30 29 29 30 30 30 29 30
EOS

enum_fasta = Bio::FastaFormat.new(fasta).each_entry
enum_qual = Bio::FastaNumericFormat.new(qual).each_entry

loop do
   fastq = Bio::Sequence.adapter(enum_fasta.next,
                                 Bio::Sequence::Adapter::Fastq)
   fastq.quality_score_type = :phred
   fastq.quality_scores = enum_qual.next.data
   puts fastq.output(:fastq)
end

-- 
MISHIMA, Hiroyuki, DDS, Ph.D.
COE Research Fellow
Department of Human Genetics
Nagasaki University Graduate School of Biomedical Sciences


From ngoto at gen-info.osaka-u.ac.jp  Fri Jan 29 10:25:29 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Fri, 29 Jan 2010 19:25:29 +0900
Subject: [BioRuby] Proposal: Bio::FastaFormat#each_entry
In-Reply-To: <4B628437.30305@be.to>
References: <4B628437.30305@be.to>
Message-ID: <20100129102530.3BBEA1CBC436@idnmail.gen-info.osaka-u.ac.jp>

Hi,

On Fri, 29 Jan 2010 15:46:15 +0900
"MISHIMA, Hiroyuki" <missy at be.to> wrote:

> Hi all,
> 
> How about implementing the following methods?
> 
> 	Bio::FastaFormat#each_entry
> 	Bio::FastaNumericFormat#each_entry
> 
> The following is a sample code to generate a FASTQ string from a FASTA 
> string and a FASTA.QUAL string. This sample may need ruby 1.8.7 or later.
>
> I am afraid that simpler or easier ways are already existed in BioRuby...

I think mixing single entry parser with multiple entry iterator
will cause confusion, and not good way.

For most parser classes in bioruby, expected data source is
String containing single entry data. In addition, for IO with
possible multiple entries, Bio::FlatFile is the front-end that
can detect data type, splits each entry, and calling assigned
parser class.

For String containing multiple entries, using StringIO and
then Bio::FlatFile is the easiest way, although indirect.
Recently, many efficient memory-mapped data transfer methods
are available, e.g. memcached, IPC shared memory, mmap(2)
system call. I'm now thinking how to treat such data efficiently.

Below is an example using StringIO and Bio::FlatFile.
#------------------------------------------------
  require 'stringio'
  require 'bio'

  # When copy-and paste this script, the "> " in the head of
  # each line should be removed. 
> fasta = <<EOS
> >FXQB1I00000001
> TATGGAATCTGTAGAATCAGTGGTAGGTGCAGCAGATGGAGGAAGG
> >FXQB1I00000002
> CTGGAGAATTCTGGATCCTCGACTTATGACTTGGTGGTTCTGGTAACTGTGAGCTTAGGATAGTCAG
> EOS
> 
> qual = <<EOS
> >FXQB1I00000001
> 30 30 29 42 25 24 5 30 30 30 30 30 28 30 26 9 30 30 30 30 30 42 25 30 30 
> 42 25 29 22 30 29 26 30 30 30 29 30 42 25 30 32 17 40 23 39 24
> >FXQB1I00000002
> 30 30 33 19 28 30 26 9 32 12 30 30 33 20 30 30 32 15 27 27 30 28 28 34 
> 22 27 22 28 28 29 26 9 33 19 22 43 25 33 19 28 27 32 15 30 32 12 28 30 
> 27 30 30 26 27 30 40 23 30 40 23 30 29 29 30 30 30 29 30
> EOS
  
  ff_fasta = Bio::FlatFile.open(StringIO.new(fasta))
  ff_qual = Bio::FlatFile.open(StringIO.new(qual))

  while entry_fasta = ff.fasta.next_entry
    seq = entry_fasta.to_biosequence
    seq.quality_score_type = :phred
    seq.quality_scores = ff_qual.next_entry.data
    puts fastq.output(:fastq, :title => entry_fasta.definition)
  end
#------------------------------------------------

> enum_fasta = Bio::FastaFormat.new(fasta).each_entry
> enum_qual = Bio::FastaNumericFormat.new(qual).each_entry
> 
> loop do
>    fastq = Bio::Sequence.adapter(enum_fasta.next,
>                                  Bio::Sequence::Adapter::Fastq)
>    fastq.quality_score_type = :phred
>    fastq.quality_scores = enum_qual.next.data
>    puts fastq.output(:fastq)
> end

Bio::Sequence.adapter is bioruby library internal use only,
and normally should not be used by user scripts. In addition,
using Adapter::Fastq for Bio::FastaFormat data is mismatch. 
In this case, use Bio::FastaFormat#to_biosequence.

> 
> -- 
> MISHIMA, Hiroyuki, DDS, Ph.D.
> COE Research Fellow
> Department of Human Genetics
> Nagasaki University Graduate School of Biomedical Sciences

Thanks,

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org


From missy at be.to  Fri Jan 29 11:24:15 2010
From: missy at be.to (MISHIMA, Hiroyuki)
Date: Fri, 29 Jan 2010 20:24:15 +0900
Subject: [BioRuby] Proposal: Bio::FastaFormat#each_entry
In-Reply-To: <20100129102530.3BBEA1CBC436@idnmail.gen-info.osaka-u.ac.jp>
References: <4B628437.30305@be.to>
	<20100129102530.3BBEA1CBC436@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <4B62C55F.1050506@be.to>

Hi, Naohisa GOTO,

Thank you so much for detailed explanation and a sample code. It was big
help for me to understand BioRuby's overall design.

Although I used here-documents in my code, what I wanted to do was just
make a FASTQ file from regular FASTA and FASTA.QUAL files.

I tried your code using my relatively large input files. It was much
faster than my code.

The final code is simply the following:
----
require 'bio'

ff_fasta = Bio::FlatFile.open(ARGV[0])
ff_qual = Bio::FlatFile.open(ARGV[0]+".qual")

while entry_fasta = ff_fasta.next_entry
   seq = entry_fasta.to_biosequence
   seq.quality_score_type = :phred
   seq.quality_scores = ff_qual.next_entry.data
   puts seq.output(:fastq, :title => entry_fasta.definition)
end
----

Hiro.

Naohisa GOTO wrote (2010/01/29 19:25):
> Hi,
>
> On Fri, 29 Jan 2010 15:46:15 +0900
> "MISHIMA, Hiroyuki"<missy at be.to>  wrote:
>
>> Hi all,
>>
>> How about implementing the following methods?
>>
>> 	Bio::FastaFormat#each_entry
>> 	Bio::FastaNumericFormat#each_entry
>>
>> The following is a sample code to generate a FASTQ string from a FASTA
>> string and a FASTA.QUAL string. This sample may need ruby 1.8.7 or later.
>>
>> I am afraid that simpler or easier ways are already existed in BioRuby...
>
> I think mixing single entry parser with multiple entry iterator
> will cause confusion, and not good way.
>
> For most parser classes in bioruby, expected data source is
> String containing single entry data. In addition, for IO with
> possible multiple entries, Bio::FlatFile is the front-end that
> can detect data type, splits each entry, and calling assigned
> parser class.
>
> For String containing multiple entries, using StringIO and
> then Bio::FlatFile is the easiest way, although indirect.
> Recently, many efficient memory-mapped data transfer methods
> are available, e.g. memcached, IPC shared memory, mmap(2)
> system call. I'm now thinking how to treat such data efficiently.

-- 
MISHIMA, Hiroyuki, DDS, Ph.D.
COE Research Fellow
Department of Human Genetics
Nagasaki University Graduate School of Biomedical Sciences


From biopython at maubp.freeserve.co.uk  Fri Jan 29 10:36:40 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 29 Jan 2010 10:36:40 +0000
Subject: [BioRuby] [Bioperl-l] [MOBY-dev] OpenBio solution challenge:
	Project updates at BOSC 2010
In-Reply-To: <op.u69hfujinbznux@dd0710001l.icapture.ubc.ca>
References: <20100128203505.GG40046@sobchak.mgh.harvard.edu>
	<op.u69hfujinbznux@dd0710001l.icapture.ubc.ca>
Message-ID: <320fb6e01001290236l1ad02515w403a19f94dbb6d15@mail.gmail.com>

Hi all,

This is a great topic but should be continue it on just the one mailing list?
Is there a suitable BOSC list, or how about the general Open Bio list?

On Thu, Jan 28, 2010 at 9:17 PM, Mark Wilkinson <markw at illuminae.com> wrote:
>
> Brad, this sounds exciting!
>
> One thing strikes me, though - by asking for the sub-projects to propose
> the "grand challenge" themselves the one thing you can guarantee is that
> the "grand challenge" is solvable (or more likely, already solved!)
>
> Other "grand challenge" kinds of meetings have an independent third party
> pose the problem that has to be solved, and then all groups work toward a
> solution and compare their results. ?This would, IMO, be more revealing of
> the "state of the art" in each Open-Bio project, and point out where the
> weaknesses are that we should be focusing on... ?Someone (for example,
> you!) could act as the moderator to ensure that the "grand challenge" was
> at least a reasonable one, within the scope of what an Open-Bio project
> *should* be able to solve...
>
> Just my CAD $0.02
>
> Mark

One possible problem with having Brad act as moderator is his ties to
Biopython (plus it would be a shame if we'd be one man down for trying
to solve the challenges - grin). Having a project representative "sign off"
on the challenge might work - or simply the whole of the BOSC committee
which is quite balanced. Alternatively some kind of panel of challenges does
seem a good way to reduce individual project bias (as suggest by Scooter),
but there will still need to be a judging committee.

I'm curious what kind of challenges the BOSC committee had in mind -
would something like taking a newly sequence bacteria and producing
an automated annotation as a GenBank, EMBL, or GFF  file be too
ambitious for example? There are already several major projects
to do this e.g. RAST http://rast.nmpdr.org/

Peter
(@Biopython)