From ngoto at gen-info.osaka-u.ac.jp Mon Sep 1 06:44:06 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Mon, 1 Sep 2008 19:44:06 +0900 Subject: [BioRuby] test and bioruby shell questions In-Reply-To: References: <3AF7B27B-E642-48FD-A612-ED88E9FF28BC@ifom-ieo-campus.it> <20080831042546.D246F1CBC56E@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20080901104407.6D8821CBC40F@idnmail.gen-info.osaka-u.ac.jp> Hi Ben, The failures 4) to 7) may be caused by the conflicts of test class names. I changed test class names to fix this. (commits 536cdf903a3c3908c117efd554d33117d91452f4 and 0fe1e7d3ed02185632f4a34d8efe1f21f755b289). Current HEAD is: http://github.com/bioruby/bioruby/commit/0fe1e7d3ed02185632f4a34d8efe1f21f755b289 Note that the first three failures are still unfixed. Could you please try again? Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Sun, 31 Aug 2008 17:11:31 +1000 "Ben Woodcroft" wrote: > Hi, > > Thanks for your concern. > After pulling from the newest github - > http://github.com/bioruby/bioruby/commit/e86f8d757c45805389e154f06ccde5a3d9e8a557 > > $ ruby -v > ruby 1.8.6 (2007-09-24 patchlevel 111) [i486-linux] > $ uname -a > Linux uyen 2.6.24-21-generic #1 SMP Mon Aug 25 17:32:09 UTC 2008 i686 GNU/Linux > > Using Ubuntu Hardy, and the latest patched version of the ruby1.8 > package (1.8.6.111-2ubuntu1.1) > > $ ruby runner.rb > Loaded suite . > Started > .....FF..F.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................. .! > ..............................................................................................................................................................................................................................................................................................F............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................FFE................................................ > Finished in 142.816902 seconds. > > 1) Failure: > test_gff_exportview(Bio::FuncTestEnsemblHuman) > [./functional/bio/io/test_ensembl.rb:95]: > <"4\tEnsembl\tGene\t1148366\t1151952\t.\t+\t1\tgene_id=ENSG00000206158; > transcript_id=ENST00000382964; exon_id=ENSE00001494097; > gene_type=KNOWN_protein_coding\n"> expected but was > <"">. > > 2) Failure: > test_gff_exportview_with_named_args(Bio::FuncTestEnsemblHuman) > [./functional/bio/io/test_ensembl.rb:121]: > <"4\tEnsembl\tGene\t1148366\t1151952\t.\t+\t1\tgene_id=ENSG00000206158; > transcript_id=ENST00000382964; exon_id=ENSE00001494097; > gene_type=KNOWN_protein_coding\n"> expected but was > <"">. > > 3) Failure: > test_tab_exportview_with_named_args(Bio::FuncTestEnsemblHuman) > [./functional/bio/io/test_ensembl.rb:180]: > <"seqname\tsource\tfeature\tstart\tend\tscore\tstrand\tframe\tgene_id\ttranscript_id\texon_id\tgene_type\n4\tEnsembl\tGene\t1148366\t1151952\t.\t+\t1\tENSG00000206158\tENST00000382964\tENSE00001494097\tKNOWN_protein_coding\n"> > expected but was > <"seqname\tsource\tfeature\tstart\tend\tscore\tstrand\tframe\tgene_id\ttranscript_id\texon_id\tgene_type\n">. > > 4) Failure: > test_id_line_sequence_version(Bio::TestEMBL) > [./unit/bio/db/embl/test_embl_rel89.rb:45]: > <"1"> expected but was > . > > 5) Failure: > test_left_padding(Bio::TestStringFormatting) > [./unit/bio/util/restriction_enzyme/test_string_formatting.rb:43]: > <"nnnnnnn"> expected but was > <"">. > > 6) Failure: > test_right_padding(Bio::TestStringFormatting) > [./unit/bio/util/restriction_enzyme/test_string_formatting.rb:50]: > <"nn"> expected but was > <"">. > > 7) Error: > test_strip_padding(Bio::TestStringFormatting): > NoMethodError: undefined method `[]' for nil:NilClass > ../lib/bio/util/restriction_enzyme/string_formatting.rb:64:in > `strip_padding' > ./unit/bio/util/restriction_enzyme/test_string_formatting.rb:33:in > `test_strip_padding' > > 1867 tests, 4049 assertions, 6 failures, 1 errors > > I don't actually care personally about these problems, but am glad to > help out in a general sense. > > Thanks, > ben From davide.rambaldi at ifom-ieo-campus.it Mon Sep 1 07:44:36 2008 From: davide.rambaldi at ifom-ieo-campus.it (Davide Rambaldi) Date: Mon, 1 Sep 2008 13:44:36 +0200 Subject: [BioRuby] test and bioruby shell questions Message-ID: On Aug 31, 2008, at 6:05 AM, Naohisa GOTO wrote: > Next time, please show all failure message, even if long. Dear Naohisa In attachment you find the complete report on the test in bioruby (I forgot to put in the first mail ... :P ) MyPlatform: bioruby 1.2.1 and ruby 1.8.7 on a Power PC G4 osx 10.4.11 TEST OUTPUT: -------------- next part -------------- Best Regards > Davide Rambaldi, Bioinformatics PhD student. ----------------------------------------------------- Bioinformatic Group IFOM-IEO Campus Via Adamello 16, Milano I-20139 Italy [t] +39 02574303 066 [e] davide.rambaldi at ifom-ieo-campus.it [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/DavideRambaldi (homepage) [i] http://www.semm.it (PhD school) [i] http://www.btbs.unimib.it/ (Master) ----------------------------------------------------- From davide.rambaldi at ifom-ieo-campus.it Mon Sep 1 08:18:03 2008 From: davide.rambaldi at ifom-ieo-campus.it (Davide Rambaldi) Date: Mon, 1 Sep 2008 14:18:03 +0200 Subject: [BioRuby] test and bioruby shell questions In-Reply-To: <992DF6FE-14DF-45B9-B141-224FF473E82B@hgc.jp> References: <3AF7B27B-E642-48FD-A612-ED88E9FF28BC@ifom-ieo-campus.it> <992DF6FE-14DF-45B9-B141-224FF473E82B@hgc.jp> Message-ID: On Aug 30, 2008, at 2:16 PM, Toshiaki Katayama wrote: > bioruby> rm :a > > Actually, the rm command temporally assign 'nil' to the variable > because BioRuby shell will avoid to dump variables having 'nil' as > its value. > (This means, the memory will not be returned to the OS until next GC.) > > This implementation looks somewhat ugly, so if you have a better > idea, please let me know. Dear Toshiaki, I have implemented another method (rm2) that extend your rm command using a case statement and the === (case equality) I practice, if he found a String or a Symbol I just call rm(name), while if he find an Array, he iterate into the Array to call rm(e) on each element: def rm2(name) # check class case name when String, Symbol : rm(name) when Array : name.each do |e| rm(e) end end end I allow to use this kind of commands: rm2(list=ls()) <-- R console style! :P Put the list of current objects into an Array named list, then remove all! Obviously is inspired from the R console :P (that have the same command) I have putted him directly into the bin/bioruby.rb file to test and seems to work... tell me if is useful! and don't esitate to add him to the current code if you think is a good idea. cheers and best regards! Davide Rambaldi, Bioinformatics PhD student. ----------------------------------------------------- Bioinformatic Group IFOM-IEO Campus Via Adamello 16, Milano I-20139 Italy [t] +39 02574303 066 [e] davide.rambaldi at ifom-ieo-campus.it [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/DavideRambaldi (homepage) [i] http://www.semm.it (PhD school) [i] http://www.btbs.unimib.it/ (Master) ----------------------------------------------------- From donttrustben at gmail.com Mon Sep 1 09:01:47 2008 From: donttrustben at gmail.com (Ben Woodcroft) Date: Mon, 1 Sep 2008 23:01:47 +1000 Subject: [BioRuby] test and bioruby shell questions In-Reply-To: <20080901104407.6D8821CBC40F@idnmail.gen-info.osaka-u.ac.jp> References: <3AF7B27B-E642-48FD-A612-ED88E9FF28BC@ifom-ieo-campus.it> <20080831042546.D246F1CBC56E@idnmail.gen-info.osaka-u.ac.jp> <20080901104407.6D8821CBC40F@idnmail.gen-info.osaka-u.ac.jp> Message-ID: Hi, Your commits seem to fix things, as I only get the first 3 errors. Thanks again, ben $ ruby runner.rb Loaded suite . Started .....FF..F........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ Finished in 141.782259 seconds. 1) Failure: test_gff_exportview(Bio::FuncTestEnsemblHuman) [./functional/bio/io/test_ensembl.rb:95]: <"4\tEnsembl\tGene\t1148366\t1151952\t.\t+\t1\tgene_id=ENSG00000206158; transcript_id=ENST00000382964; exon_id=ENSE00001494097; gene_type=KNOWN_protein_coding\n"> expected but was <"">. 2) Failure: test_gff_exportview_with_named_args(Bio::FuncTestEnsemblHuman) [./functional/bio/io/test_ensembl.rb:121]: <"4\tEnsembl\tGene\t1148366\t1151952\t.\t+\t1\tgene_id=ENSG00000206158; transcript_id=ENST00000382964; exon_id=ENSE00001494097; gene_type=KNOWN_protein_coding\n"> expected but was <"">. 3) Failure: test_tab_exportview_with_named_args(Bio::FuncTestEnsemblHuman) [./functional/bio/io/test_ensembl.rb:180]: <"seqname\tsource\tfeature\tstart\tend\tscore\tstrand\tframe\tgene_id\ttranscript_id\texon_id\tgene_type\n4\tEnsembl\tGene\t1148366\t1151952\t.\t+\t1\tENSG00000206158\tENST00000382964\tENSE00001494097\tKNOWN_protein_coding\n"> expected but was <"seqname\tsource\tfeature\tstart\tend\tscore\tstrand\tframe\tgene_id\ttranscript_id\texon_id\tgene_type\n">. 1906 tests, 4111 assertions, 3 failures, 0 errors 2008/9/1 Naohisa GOTO > Hi Ben, > > The failures 4) to 7) may be caused by the conflicts of test class names. > I changed test class names to fix this. > (commits 536cdf903a3c3908c117efd554d33117d91452f4 and > 0fe1e7d3ed02185632f4a34d8efe1f21f755b289). > > Current HEAD is: > > http://github.com/bioruby/bioruby/commit/0fe1e7d3ed02185632f4a34d8efe1f21f755b289 > > Note that the first three failures are still unfixed. > > Could you please try again? > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > On Sun, 31 Aug 2008 17:11:31 +1000 > "Ben Woodcroft" wrote: > > > Hi, > > > > Thanks for your concern. > > After pulling from the newest github - > > > http://github.com/bioruby/bioruby/commit/e86f8d757c45805389e154f06ccde5a3d9e8a557 > > > > $ ruby -v > > ruby 1.8.6 (2007-09-24 patchlevel 111) [i486-linux] > > $ uname -a > > Linux uyen 2.6.24-21-generic #1 SMP Mon Aug 25 17:32:09 UTC 2008 i686 > GNU/Linux > > > > Using Ubuntu Hardy, and the latest patched version of the ruby1.8 > > package (1.8.6.111-2ubuntu1.1) > > > > $ ruby runner.rb > > Loaded suite . > > Started > > > .....FF..F.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................. > .! > > > ..............................................................................................................................................................................................................................................................................................F............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................FFE................................................ > > Finished in 142.816902 seconds. > > > > 1) Failure: > > test_gff_exportview(Bio::FuncTestEnsemblHuman) > > [./functional/bio/io/test_ensembl.rb:95]: > > <"4\tEnsembl\tGene\t1148366\t1151952\t.\t+\t1\tgene_id=ENSG00000206158; > > transcript_id=ENST00000382964; exon_id=ENSE00001494097; > > gene_type=KNOWN_protein_coding\n"> expected but was > > <"">. > > > > 2) Failure: > > test_gff_exportview_with_named_args(Bio::FuncTestEnsemblHuman) > > [./functional/bio/io/test_ensembl.rb:121]: > > <"4\tEnsembl\tGene\t1148366\t1151952\t.\t+\t1\tgene_id=ENSG00000206158; > > transcript_id=ENST00000382964; exon_id=ENSE00001494097; > > gene_type=KNOWN_protein_coding\n"> expected but was > > <"">. > > > > 3) Failure: > > test_tab_exportview_with_named_args(Bio::FuncTestEnsemblHuman) > > [./functional/bio/io/test_ensembl.rb:180]: > > > <"seqname\tsource\tfeature\tstart\tend\tscore\tstrand\tframe\tgene_id\ttranscript_id\texon_id\tgene_type\n4\tEnsembl\tGene\t1148366\t1151952\t.\t+\t1\tENSG00000206158\tENST00000382964\tENSE00001494097\tKNOWN_protein_coding\n"> > > expected but was > > > <"seqname\tsource\tfeature\tstart\tend\tscore\tstrand\tframe\tgene_id\ttranscript_id\texon_id\tgene_type\n">. > > > > 4) Failure: > > test_id_line_sequence_version(Bio::TestEMBL) > > [./unit/bio/db/embl/test_embl_rel89.rb:45]: > > <"1"> expected but was > > . > > > > 5) Failure: > > test_left_padding(Bio::TestStringFormatting) > > [./unit/bio/util/restriction_enzyme/test_string_formatting.rb:43]: > > <"nnnnnnn"> expected but was > > <"">. > > > > 6) Failure: > > test_right_padding(Bio::TestStringFormatting) > > [./unit/bio/util/restriction_enzyme/test_string_formatting.rb:50]: > > <"nn"> expected but was > > <"">. > > > > 7) Error: > > test_strip_padding(Bio::TestStringFormatting): > > NoMethodError: undefined method `[]' for nil:NilClass > > ../lib/bio/util/restriction_enzyme/string_formatting.rb:64:in > > `strip_padding' > > ./unit/bio/util/restriction_enzyme/test_string_formatting.rb:33:in > > `test_strip_padding' > > > > 1867 tests, 4049 assertions, 6 failures, 1 errors > > > > I don't actually care personally about these problems, but am glad to > > help out in a general sense. > > > > Thanks, > > ben > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > -- FYI: My email addresses at unimelb, uq and gmail all redirect to the same place. From pjotr2008 at thebird.nl Tue Sep 2 02:50:56 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Tue, 2 Sep 2008 08:50:56 +0200 Subject: [BioRuby] BioRuby standards Message-ID: <20080902065055.GA29634@thebird.nl> Hi everyone, I have been doing some work on microarray support for BioRuby, see http://github.com/pjotrp/bioruby/tree/bioruby-testing-pjotr There are two questions I want to raise about standards, as I see different solutions in the current tree. First is about error handling. Second about caching. 1) Error handling ought to print to stderr, and we need a consistent way of handling them, as well as a more fine grained approach towards warnings, info, debug etc. messages. Can we come up with a standard where a user can set these from outside Bioruby, e.g. through an environment setting. And what classes can we use for consistent messaging. Obviously a standard way for exceptions is part of that. 2) Web based tools often like to cache things on the local file system. I suggest using BIORUBY_CACHE as a standard environment variable. And, perhaps, BIORUBY_CACHE_SIZE, though that would require a module to monitor that. For (1) David Powers came up with a nice approach for the Cfruby project - where modules can override behaviour of the error handling (I wanted that for the Cfenjin application). See http://rubyforge.org/projects/cfruby/ and the source code at: http://cfruby.rubyforge.org/svn/lib/libcfruby/flowmonitor.rb with my usage: http://cfruby.rubyforge.org/svn/lib/libcfenjin/cfp_logger.rb http://cfruby.rubyforge.org/svn/lib/libcfenjin/cfp_flowmonitor.rb In my case I wanted to override the standard single switch for WARN, INFO, DEBUG etc., with a second switch for TRACING, VERBOSITY levels and TESTING. For BioRuby it is simpler, as we have (perhaps) have no such requirement at the library level. Pj. From ngoto at gen-info.osaka-u.ac.jp Tue Sep 2 04:47:11 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Tue, 2 Sep 2008 17:47:11 +0900 Subject: [BioRuby] BioRuby standards In-Reply-To: <20080902065055.GA29634@thebird.nl> References: <20080902065055.GA29634@thebird.nl> Message-ID: <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> Hi, On Tue, 2 Sep 2008 08:50:56 +0200 pjotr2008 at thebird.nl (Pjotr Prins) wrote: > Hi everyone, > > I have been doing some work on microarray support for BioRuby, see > > http://github.com/pjotrp/bioruby/tree/bioruby-testing-pjotr > > There are two questions I want to raise about standards, as I see > different solutions in the current tree. First is about error > handling. Second about caching. > > 1) Error handling ought to print to stderr, and we need a consistent > way of handling them, as well as a more fine grained approach towards > warnings, info, debug etc. messages. Can we come up with a standard > where a user can set these from outside Bioruby, e.g. through an > environment setting. And what classes can we use for consistent > messaging. Obviously a standard way for exceptions is part of that. As you said, no standards, but, empirically in BioRuby, * Small errors are simply ignored and the program continues. * When normal (but not severe) errors, prints warning messages to $stdout, and continues to process. * When severe error, raises error. > > 2) Web based tools often like to cache things on the local file > system. I suggest using BIORUBY_CACHE as a standard environment > variable. And, perhaps, BIORUBY_CACHE_SIZE, though that would require > a module to monitor that. Because BioRuby is a library (except for BioRuby Shell), it is generally not so good to depend on environment variables. Instead, to prepare APIs to set cache positions and sizes is better. Note that some classes use Tempfile class, a standard bundled class with Ruby by default, and the Tempfile class depends on enviroment variables (TMPDIR, TMP, etc.). I think cache isn't suitable for standard, because its purpose may differ from program (or class, module, etc.) to program. For example, if I want to put class A's cache on a fast hard disk with very large size, and program B's cache on a slower hard disk with small size, what should I do? > For (1) David Powers came up with a nice approach for the Cfruby > project - where modules can override behaviour of the error handling > (I wanted that for the Cfenjin application). See > > http://rubyforge.org/projects/cfruby/ > > and the source code at: > > http://cfruby.rubyforge.org/svn/lib/libcfruby/flowmonitor.rb > > with my usage: > > http://cfruby.rubyforge.org/svn/lib/libcfenjin/cfp_logger.rb > http://cfruby.rubyforge.org/svn/lib/libcfenjin/cfp_flowmonitor.rb > > In my case I wanted to override the standard single switch for WARN, > INFO, DEBUG etc., with a second switch for TRACING, VERBOSITY levels > and TESTING. For BioRuby it is simpler, as we have (perhaps) have no > such requirement at the library level. I've not seen this yet, but is it different from the Logger class, a standard bundled class with Ruby? http://www.ruby-doc.org/stdlib/libdoc/logger/rdoc/classes/Logger.html Thanks, -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From pjotr2008 at thebird.nl Tue Sep 2 05:19:58 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Tue, 2 Sep 2008 11:19:58 +0200 Subject: [BioRuby] BioRuby standards In-Reply-To: <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> References: <20080902065055.GA29634@thebird.nl> <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20080902091958.GA31400@thebird.nl> Hi Naohisa, Thanks for your reply. Some comments. On Tue, Sep 02, 2008 at 05:47:11PM +0900, Naohisa GOTO wrote: > As you said, no standards, but, empirically in BioRuby, > > * Small errors are simply ignored and the program continues. > * When normal (but not severe) errors, prints warning messages > to $stdout, and continues to process. > * When severe error, raises error. This is fine for an interactive program - like the shell. But it is not such a good strategy for software calling into Bioruby (think of a web server). I am unhappy with this state of things. Can we come up with something better? I think in the long term this will help predictability of BioRuby. > Because BioRuby is a library (except for BioRuby Shell), > it is generally not so good to depend on environment variables. Fair enough. > Instead, to prepare APIs to set cache positions and sizes > is better. That would be cool. That API could take care of environment options too, if we were ever to introduce them. > Note that some classes use Tempfile class, a standard bundled > class with Ruby by default, and the Tempfile class depends > on enviroment variables (TMPDIR, TMP, etc.). I noticed. Caching is a bit different in nature - as caches may be there for a long time. TMPDIRs get emptied on reboot, for one. > I think cache isn't suitable for standard, because its purpose > may differ from program (or class, module, etc.) to program. > For example, if I want to put class A's cache on a fast hard disk > with very large size, and program B's cache on a slower hard disk > with small size, what should I do? That is true. OK, leave caching for the modules to resolve. I'll use my own caching of GEO XML objects. > > For (1) David Powers came up with a nice approach for the Cfruby > > project - where modules can override behaviour of the error handling > > (I wanted that for the Cfenjin application). See > > > > http://rubyforge.org/projects/cfruby/ > > > > and the source code at: > > > > http://cfruby.rubyforge.org/svn/lib/libcfruby/flowmonitor.rb > > > > with my usage: > > > > http://cfruby.rubyforge.org/svn/lib/libcfenjin/cfp_logger.rb > > http://cfruby.rubyforge.org/svn/lib/libcfenjin/cfp_flowmonitor.rb > > > > In my case I wanted to override the standard single switch for WARN, > > INFO, DEBUG etc., with a second switch for TRACING, VERBOSITY levels > > and TESTING. For BioRuby it is simpler, as we have (perhaps) have no > > such requirement at the library level. > > I've not seen this yet, but is it different from the Logger class, > a standard bundled class with Ruby? > http://www.ruby-doc.org/stdlib/libdoc/logger/rdoc/classes/Logger.html The difference is that David's version makes use of an observer pattern to allow overriding and enhancing. This allows a program to change behaviour of all (internal) library error handling in a transparent fashion. Ignore it, it is over the top for BioRuby. Note: using the logger class consistently would already be a great improvement. Pj. From davide.rambaldi at ifom-ieo-campus.it Tue Sep 2 06:28:59 2008 From: davide.rambaldi at ifom-ieo-campus.it (Davide Rambaldi) Date: Tue, 2 Sep 2008 12:28:59 +0200 Subject: [BioRuby] Bio::Blat Message-ID: Hi all, I am trying to use Ruby and BioRuby to translate a Perl script that I am using in my lab to parse psl files. The blatanalyzer script should: sort entries according to identity, coverage, score, cut psl files in order to keep only alignments with a given identity, generate report tables (similar to a web blat result table in the UCSC server), convert psl to gff and gtf, etc... USAGE: Usage ./blatanalyzer.rb [options] action file.psl and can be used also in a pipe (cat file.psl | ./blatanalyzer.rb action) I am a newbye of Ruby scripting (and also I am currently trying to understand the conventions used in BioRuby) so I am not sure if my design is decent or completely stupid/crazy. First of all, I need some extra methods not present in Bio::Blat::Report (like coverage, sorting_by, grouping, etc...) so my idea is to made a subclass of Bio::Blat::Report: module Bio class Blat class Analyzer < Report def coverage implementation here ... end end end end Is this a good idea? On the other side I am working on a Bio::Blat::Application that should initialize options (parsed by a OptParser class), load a stream, pass the stream to the Bio::Blat::Analyzer object, choose which method (action) apply to the stream. Is OK to put this code in the Bio::Blat namespace? or I should put it in an external Application class? Actually the structure of my blatanalyzer.rb application is this one class Color # to handle colorized output (use term-ansicolor) end class OptParser # parse command line options end module Bio class Blat class Analyzer < Report # extend the functionality of the Report with sorting, grouping and other methods end class Application # load a stream, check options, select action and execute it printing result on STDOUT end end end # MAIN.APP # slurp command line options and start application options = OptParser.parse(ARGV) Bio::Blat::Application.new(options,ARGF) Something I need to change? make sense? Thanks for your help, any suggestion is really welcome! Davide Rambaldi, Bioinformatics PhD student. ----------------------------------------------------- Bioinformatic Group IFOM-IEO Campus Via Adamello 16, Milano I-20139 Italy [t] +39 02574303 066 [e] davide.rambaldi at ifom-ieo-campus.it [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/DavideRambaldi (homepage) [i] http://www.semm.it (PhD school) [i] http://www.btbs.unimib.it/ (Master) ----------------------------------------------------- From jan.aerts at gmail.com Tue Sep 2 11:46:43 2008 From: jan.aerts at gmail.com (Jan Aerts) Date: Tue, 2 Sep 2008 16:46:43 +0100 Subject: [BioRuby] official announcement move of bioruby from CVS to git Message-ID: <4c7507a70809020846w2fe0cbe4m17545950c0e33b42@mail.gmail.com> All, We can finally tell you that bioruby has officially moved from CVS to git. Development on CVS will be discontinued. Please use the git repository at http://github.com/bioruby/bioruby from now on. How do I get bioruby? ================= Nothing has changed for how to obtain bioruby. At least for the released versions. You can still do a "gem install bio" to get the latest release as we will continue to make the gem available through rubyforge. Alternatively, it will become possible (but not yet) to install the gem directly from github: "gem sources -a http://gems.github.com" followed by a "gem install bioruby-bio". The story is different if you want to get the latest development version. Instead of doing a 'cvs checkout' or 'cvs export' as you used to do, you can clone the online git repository with "git clone git://github.com/bioruby/bioruby.git". The 'cvs update' you used to do should now be changed to a "git pull". How do I contribute to bioruby? ======================== Contributing to bioruby should be much easier with git than it was with CVS. See this blog post (http://saaientist.blogspot.com/2008/06/bioruby-with-git-how-would-that-work.html) for guidelines. Basically, you clone the repository locally and send a patch or a pull request. Moreover, if you use the 'fork' button on the github website, your clone will be on the github system as well and your development can be followed by everyone (see http://github.com/bioruby/bioruby/network), which is a Good Thing(TM). For a guideline on how to format your commit messages nicely, see here: http://www.tpope.net/node/106 Thanks to everyone who cloned the repository and started developing. Keep up the good work. jan. (also for Naohisa Goto) From pjotr2008 at thebird.nl Wed Sep 3 04:07:22 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Wed, 3 Sep 2008 10:07:22 +0200 Subject: [BioRuby] official announcement move of bioruby from CVS to git Message-ID: <20080903080722.GB9055@thebird.nl> > We can finally tell you that bioruby has officially moved from CVS to > git. Development on CVS will be discontinued. Please use the git > repository at http://github.com/bioruby/bioruby from now on. This is great! I must say, the more I use git, the more I like it. This is the version control system I have always wanted (after darcs and Mercurial). It is a tad complex when using more advanced features, but once they work they are stunningly good. And github is also an astounding tool (much of it Ruby based, I gather). Every bioinformatician should make git part of his/her toolbox. Really. Pj. From davide.rambaldi at ifom-ieo-campus.it Wed Sep 3 05:45:05 2008 From: davide.rambaldi at ifom-ieo-campus.it (Davide Rambaldi) Date: Wed, 3 Sep 2008 11:45:05 +0200 Subject: [BioRuby] Bio::Blat::Report Message-ID: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> Hi, after installing the last version from git (http://github.com/ bioruby/bioruby), I have a couple of warnings using my application: NOTE: the file test.psl I am using for testing is without psl headers Oni:~/code/Ruby/bioruby tucano$ ./blatanalyzer list blatanalyzerdir/ test/test.psl /usr/local/lib/ruby/site_ruby/1.8/bio/io/flatfile/splitter.rb:81: warning: private attribute? /usr/local/lib/ruby/site_ruby/1.8/bio/io/flatfile/splitter.rb:84: warning: private attribute? /usr/local/lib/ruby/site_ruby/1.8/bio/io/flatfile/splitter.rb:87: warning: private attribute? /usr/local/lib/ruby/site_ruby/1.8/bio/io/flatfile/splitter.rb:90: warning: private attribute? /usr/local/lib/ruby/site_ruby/1.8/bio/io/flatfile/splitter.rb:93: warning: private attribute? /usr/local/lib/ruby/site_ruby/1.8/bio/io/flatfile/splitter.rb:96: warning: private attribute? /usr/local/lib/ruby/site_ruby/1.8/bio/io/flatfile/splitter.rb:270: warning: private attribute? /usr/local/lib/ruby/site_ruby/1.8/bio/appl/blat/report.rb:89: warning: instance variable @header_lines not initialized The previuos version I was using don't give warnings here the diff of changes in the new git version and in the previous report.rb: diff oldreport.rb /usr/local/lib/ruby/site_ruby/1.8/bio/appl/blat/ report.rb 48a49,51 > # Splitter for Bio::FlatFile > FLATFILE_SPLITTER = Bio::FlatFile::Splitter::LineOriented > 53c56 < def initialize(text) --- > def initialize(text = '') 57c60 < text.each do |line| --- > text.each_line do |line| 74c77,115 < @columns = parse_header(head) --- > @columns = parse_header(head) unless head.empty? > end > > # Adds a header line if the header data is not yet given and > # the given line is suitable for header. > # Returns self if adding header line is succeeded. > # Otherwise, returns false (the line is not added). > def add_header_line(line) > return false if defined? @columns > line = line.chomp > case line > when /^\d/ > @columns = defined? @header_lines ? parse_header (@header_lines) : [] > return false > when /\A\-+\s*\z/ > @columns = defined? @header_lines ? parse_header (@header_lines) : [] > return self > else > @header_lines ||= [] > @header_lines.push line > end > end > > # Adds a line to the entry if the given line is regarded as > # a part of the current entry. > # If the current entry (self) is empty, or the line has the same > # query name, the line is added and returns self. > # Otherwise, returns false (the line is not added). > def add_line(line) > if /\A\s*\z/ =~ line then > return @hits.empty? ? self : false > end > hit = Hit.new(line.chomp) > if @hits.empty? or @hits.first.query.name == hit.query.name then > @hits.push hit > return self > else > return false > end Best Regards Davide Rambaldi, Bioinformatics PhD student. ----------------------------------------------------- Bioinformatic Group IFOM-IEO Campus Via Adamello 16, Milano I-20139 Italy [t] +39 02574303 066 [e] davide.rambaldi at ifom-ieo-campus.it [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/DavideRambaldi (homepage) [i] http://www.semm.it (PhD school) [i] http://www.btbs.unimib.it/ (Master) ----------------------------------------------------- From mail at michaelbarton.me.uk Wed Sep 3 08:27:23 2008 From: mail at michaelbarton.me.uk (Michael Barton) Date: Wed, 3 Sep 2008 13:27:23 +0100 Subject: [BioRuby] official announcement move of bioruby from CVS to git In-Reply-To: <20080903080722.GB9055@thebird.nl> References: <20080903080722.GB9055@thebird.nl> Message-ID: I completely agree with what Pjotr has written. I think moving to git/github is great step for BioRuby, and I hope to see it pay divends in the future for the development of the code base. This is great work by everyone involved to move BioRuby over to git. Mike On Wed, Sep 3, 2008 at 9:07 AM, Pjotr Prins wrote: > > We can finally tell you that bioruby has officially moved from CVS to > > git. Development on CVS will be discontinued. Please use the git > > repository at http://github.com/bioruby/bioruby from now on. > > This is great! I must say, the more I use git, the more I like it. > This is the version control system I have always wanted (after darcs > and Mercurial). It is a tad complex when using more advanced features, > but once they work they are stunningly good. And github is also an > astounding tool (much of it Ruby based, I gather). > > Every bioinformatician should make git part of his/her toolbox. > Really. > > Pj. > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From mail at michaelbarton.me.uk Wed Sep 3 08:32:56 2008 From: mail at michaelbarton.me.uk (Michael Barton) Date: Wed, 3 Sep 2008 13:32:56 +0100 Subject: [BioRuby] Ruby in the minority for bioinformaticians. Message-ID: I recently ran a survey of bioinformaticians which included which programming language do you use. The results will be somewhat biased to people who read blogs etcetera, but does show that Ruby has had a somewhat small uptake in the bioinformatics community. The results can be found here (loads slowly at the moment). http://openwetware.org/wiki/Biogang:Projects/Bioinformatics_Career_Survey_2008_Results Mike From ngoto at gen-info.osaka-u.ac.jp Wed Sep 3 09:34:28 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 3 Sep 2008 22:34:28 +0900 Subject: [BioRuby] Bio::Blat::Report In-Reply-To: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> Message-ID: <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> Hi, Thanks for reporting bugs. On Wed, 3 Sep 2008 11:45:05 +0200 Davide Rambaldi wrote: > Hi, after installing the last version from git (http://github.com/ > bioruby/bioruby), I have a couple of warnings using my application: > > NOTE: the file test.psl I am using for testing is without psl headers > > Oni:~/code/Ruby/bioruby tucano$ ./blatanalyzer list blatanalyzerdir/ > test/test.psl > /usr/local/lib/ruby/site_ruby/1.8/bio/io/flatfile/splitter.rb:81: > warning: private attribute? (snip) > /usr/local/lib/ruby/site_ruby/1.8/bio/appl/blat/report.rb:89: > warning: instance variable @header_lines not initialized The warning message "warning: instance variable @header_lines not initialized" was a bug during header parsing. The messages "warning: private attribute?" are harmless now, but I've changed not to be shown by explicitly specifying private attributes using "private". I've just fixed them in git repository. http://github.com/bioruby/bioruby/commit/3ff940988b76bdff75679cdf0af4c836f76fa3a1 http://github.com/bioruby/bioruby/commit/1440b766202a2b66ac7386b9b46928834a9c9873 Could you please try again with new version? FYI: When reporting, please show which Ruby version, OS, and architecture (type of CPU) you are using, with BioRuby version. In addition, please show a short script and test data to reproduce the bug, or please show all your scripts and data (If very large, put them to your homepage or blog). Note that in this case, I can find problem without these information, and you don't need to do so unless the bugs are not fixed well. > The previuos version I was using don't give warnings > > here the diff of changes in the new git version and in the previous > report.rb: > > diff oldreport.rb /usr/local/lib/ruby/site_ruby/1.8/bio/appl/blat/ > report.rb Please don't show diffs between already committed versions, except when you can clearly point out what is wrong. Normally, to see diffs with commit messages, doing % git log -p lib/bio/appl/blat/report.rb is enough. Thanks, Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From davide.rambaldi at ifom-ieo-campus.it Wed Sep 3 10:30:42 2008 From: davide.rambaldi at ifom-ieo-campus.it (Davide Rambaldi) Date: Wed, 3 Sep 2008 16:30:42 +0200 Subject: [BioRuby] Bio::Blat::Report In-Reply-To: <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <8B0629C9-0DC0-4DBD-BC46-CB2A5D7BF1FE@ifom-ieo-campus.it> > > Could you please try again with new version? > I've just fixed them in git repository. > http://github.com/bioruby/bioruby/commit/ > 3ff940988b76bdff75679cdf0af4c836f76fa3a1 > http://github.com/bioruby/bioruby/commit/ > 1440b766202a2b66ac7386b9b46928834a9c9873 > It's ok now. Thanks I still have 3 errors in testing the last version (just to report it...) 1) Failure: test_gff_exportview(Bio::FuncTestEnsemblHuman) [./test/functional/bio/ io/test_ensembl.rb:95]: <"4\tEnsembl\tGene\t1148366\t1151952\t.\t+\t1 \tgene_id=ENSG00000206158; transcript_id=ENST00000382964; exon_id=ENSE00001494097; gene_type=KNOWN_protein_coding\n"> expected but was <"">. 2) Failure: test_gff_exportview_with_named_args(Bio::FuncTestEnsemblHuman) [./ test/functional/bio/io/test_ensembl.rb:121]: <"4\tEnsembl\tGene\t1148366\t1151952\t.\t+\t1 \tgene_id=ENSG00000206158; transcript_id=ENST00000382964; exon_id=ENSE00001494097; gene_type=KNOWN_protein_coding\n"> expected but was <"">. 3) Failure: test_tab_exportview_with_named_args(Bio::FuncTestEnsemblHuman) [./ test/functional/bio/io/test_ensembl.rb:180]: <"seqname\tsource\tfeature\tstart\tend\tscore\tstrand\tframe\tgene_id \ttranscript_id\texon_id\tgene_type\n4\tEnsembl\tGene\t1148366 \t1151952\t.\t+\t1\tENSG00000206158\tENST00000382964\tENSE00001494097 \tKNOWN_protein_coding\n"> expected but was <"seqname\tsource\tfeature\tstart\tend\tscore\tstrand\tframe\tgene_id \ttranscript_id\texon_id\tgene_type\n">. Thanks again > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby Davide Rambaldi, Bioinformatics PhD student. ----------------------------------------------------- Bioinformatic Group IFOM-IEO Campus Via Adamello 16, Milano I-20139 Italy [t] +39 02574303 066 [e] davide.rambaldi at ifom-ieo-campus.it [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/DavideRambaldi (homepage) [i] http://www.semm.it (PhD school) [i] http://www.btbs.unimib.it/ (Master) ----------------------------------------------------- From ngoto at gen-info.osaka-u.ac.jp Wed Sep 3 10:31:43 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 3 Sep 2008 23:31:43 +0900 Subject: [BioRuby] Bio::Blat In-Reply-To: References: Message-ID: <20080903143144.3EE481CBC4CB@idnmail.gen-info.osaka-u.ac.jp> On Tue, 2 Sep 2008 12:28:59 +0200 Davide Rambaldi wrote: > Hi all, I am trying to use Ruby and BioRuby to translate a Perl > script that I am using in my lab to parse psl files. > > The blatanalyzer script should: > > sort entries according to identity, coverage, score, cut psl files in > order to keep only alignments with a given identity, > generate report tables (similar to a web blat result table in the > UCSC server), convert psl to gff and gtf, etc... > > USAGE: > > Usage ./blatanalyzer.rb [options] action file.psl > > and can be used also in a pipe (cat file.psl | ./blatanalyzer.rb action) > > > I am a newbye of Ruby scripting (and also I am currently trying to > understand the conventions used in BioRuby) so I am not sure if my > design is decent or completely stupid/crazy. > > First of all, I need some extra methods not present in > Bio::Blat::Report (like coverage, sorting_by, grouping, etc...) so > my idea is to made a subclass of Bio::Blat::Report: > > module Bio > class Blat > class Analyzer < Report > def coverage > implementation here ... > end > end > end > end > > Is this a good idea? In Ruby, a class that inherits existing class can be affected by internal changes of the existing class, including conflicts of private method names and instance variable names. If you can follow changes of the ancestral class and can change your code following the ancestral changes, to create subclass may be the most efficient way, from the viewpoint of running speed, memory efficiency, and code size. If you don't want to do so, and/or the internal structure of the ancestral class isn't clear, it is safe to store as an object, without inheritance. Note that this is only from practical point of view, as I don't know so much about the philosophy of OOP. > On the other side I am working on a Bio::Blat::Application that > should initialize options (parsed by a OptParser class), load a > stream, pass the stream to the Bio::Blat::Analyzer object, choose > which method (action) apply to the stream. > > Is OK to put this code in the Bio::Blat namespace? or I should put it > in an external Application class? In your application, you can do whatever you like. However, I think using your original namespace would be better to avoid confusion, especially when errors occur. In addition, be careful when using the mod_ruby apache module. Because mod_ruby shares Ruby interpreters among different scripts, modifying existing class/module in mod_ruby is not recommended unless you understand what you are doing. > > Actually the structure of my blatanalyzer.rb application is this one > > class Color > # to handle colorized output (use term-ansicolor) > end > > class OptParser > # parse command line options > end > > module Bio > class Blat > > class Analyzer < Report > # extend the functionality of the Report with sorting, > grouping and other methods > end > > class Application > # load a stream, check options, select action and execute it > printing result on STDOUT > end > > end > end > > # MAIN.APP > # slurp command line options and start application > options = OptParser.parse(ARGV) > Bio::Blat::Application.new(options,ARGF) > > > Something I need to change? make sense? In your application, you can do whatever you want to do. What I write here is only an empirical suggestion. -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From davide.rambaldi at ifom-ieo-campus.it Wed Sep 3 11:48:07 2008 From: davide.rambaldi at ifom-ieo-campus.it (Davide Rambaldi) Date: Wed, 3 Sep 2008 17:48:07 +0200 Subject: [BioRuby] Bio::Blat::Report In-Reply-To: <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> Hi again sorry for all this e-mails, I notice a change in the reporter object (add_line method) after commit: http://github.com/bioruby/bioruby/commit/ 88b2fb24dddcd2d5d0715e8274eda1b1ebac0abd + # Adds a line to the entry if the given line is regarded as + # a part of the current entry. + # If the current entry (self) is empty, or the line has the same + # query name, the line is added and returns self. + # Otherwise, returns false (the line is not added). + def add_line(line) + if /\A\s*\z/ =~ line then + return @hits.empty? ? self : false + end + hit = Hit.new(line.chomp) + if @hits.empty? or @hits.first.query.name == hit.query.name then + @hits.push hit + return self + else + return false + end end So now if there are more than one query_id in the input file it will be automatically splitted in different reports right? That's cool (I have developed a method in my blat analyzer to group hits by id that I can remove now). the only point I see: what append with an input with line swapped? I don't believe is a common case anyway: blat psl results are ordered by query name but can happend if you change the order of psl lines. consider this script: #!/usr/local/bin/ruby -w require 'bio' Bio::FlatFile.open(Bio::Blat::Report,ARGF).each do |report| puts "object id: " + report.object_id.to_s + " hits: " + report.hits.size.to_s + " query name:" + report.query_id end Before the commit it give only one object, and (as reported in doc) only the first query name. now with this test file: -------------- next part -------------- 3 lines of psl output with 3 different query name: output: object id: 277400 hits: 1 query name:query1 object id: 274620 hits: 1 query name:query2 object id: 271910 hits: 1 query name:query3 But if with a psl file like this one: -------------- next part -------------- Where we have 3 query names (2 hits each) and lines are not in order: object id: 277400 hits: 1 query name:query1 object id: 274620 hits: 1 query name:query2 object id: 272010 hits: 1 query name:query1 object id: 269350 hits: 1 query name:query3 object id: 266640 hits: 1 query name:query2 object id: 263930 hits: 1 query name:query3 f I sort the lines again by query name: -------------- next part -------------- object id: 277400 hits: 2 query name:query1 object id: 273590 hits: 2 query name:query2 object id: 269800 hits: 2 query name:query3 So it doesn't work if you have unsorted lines (but I guess is faster). Sorry for my bad english and for this long mail. best regards Davide Rambaldi, Bioinformatics PhD student. ----------------------------------------------------- Bioinformatic Group IFOM-IEO Campus Via Adamello 16, Milano I-20139 Italy [t] +39 02574303 066 [e] davide.rambaldi at ifom-ieo-campus.it [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/DavideRambaldi (homepage) [i] http://www.semm.it (PhD school) [i] http://www.btbs.unimib.it/ (Master) ----------------------------------------------------- From ngoto at gen-info.osaka-u.ac.jp Wed Sep 3 23:52:56 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 4 Sep 2008 12:52:56 +0900 Subject: [BioRuby] Bio::Blat::Report In-Reply-To: <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> Message-ID: <20080904035257.5B6EB1CBC5D9@idnmail.gen-info.osaka-u.ac.jp> On Wed, 3 Sep 2008 17:48:07 +0200 Davide Rambaldi wrote: > Hi again sorry for all this e-mails, > > I notice a change in the reporter object (add_line method) after commit: > http://github.com/bioruby/bioruby/commit/ > 88b2fb24dddcd2d5d0715e8274eda1b1ebac0abd > > + # Adds a line to the entry if the given line is regarded as > + # a part of the current entry. > + # If the current entry (self) is empty, or the line has the same > + # query name, the line is added and returns self. > + # Otherwise, returns false (the line is not added). > + def add_line(line) > + if /\A\s*\z/ =~ line then > + return @hits.empty? ? self : false > + end > + hit = Hit.new(line.chomp) > + if @hits.empty? or @hits.first.query.name == hit.query.name > then > + @hits.push hit > + return self > + else > + return false > + end > end > > > So now if there are more than one query_id in the input file it will > be automatically splitted in different reports right? Yes, in combination with Bio::FlatFile. The behavior was changed after this commit: http://github.com/bioruby/bioruby/commit/88b2fb24dddcd2d5d0715e8274eda1b1ebac0abd This is somehow incompatible, but good at speed and memory usage. In addition, some people requested. http://lists.open-bio.org/pipermail/bioruby-ja/2007-August/000137.html (Mailing list written in Japanese) Note that this can make mistake for data contiguously containing different query sequences with the same name. > That's cool (I have developed a method in my blat analyzer to group > hits by id that I can remove now). > > the only point I see: what append with an input with line swapped? > I don't believe is a common case anyway: blat psl results are ordered > by query name > but can happend if you change the order of psl lines. When the parser detects change of query entry name, the report object will be changed to new one. Note that the Bio::Blat::Report parser only supports files directly generated by the blat program, without post-modification. What happened with modified data is your own risk. > consider this script: > > #!/usr/local/bin/ruby -w > require 'bio' > > Bio::FlatFile.open(Bio::Blat::Report,ARGF).each do |report| > puts "object id: " + report.object_id.to_s + " hits: " + > report.hits.size.to_s + " query name:" + report.query_id > end > > Before the commit it give only one object, and (as reported in doc) > only the first query name. > > now with this test file: If you really want old bahavior, str = File.read(filename) obj = Bio::Blat::Report.new(str) the obj is a single Bio::Blat::Report object with possible multiple queries. -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From davide.rambaldi at ifom-ieo-campus.it Thu Sep 4 05:11:54 2008 From: davide.rambaldi at ifom-ieo-campus.it (Davide Rambaldi) Date: Thu, 4 Sep 2008 11:11:54 +0200 Subject: [BioRuby] Bio::Blat::Report In-Reply-To: <20080904035257.5B6EB1CBC5D9@idnmail.gen-info.osaka-u.ac.jp> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> <20080904035257.5B6EB1CBC5D9@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> On Sep 4, 2008, at 5:52 AM, Naohisa GOTO wrote: > This is somehow incompatible, but good at speed and memory usage. > In addition, some people requested. > http://lists.open-bio.org/pipermail/bioruby-ja/2007-August/000137.html > (Mailing list written in Japanese) ehm... any good translator from japanese to english (or better italian!) ? :P anyway I am agree that the strange case of mixed hits can be ignored. This commits will be available in the next version of bioruby? I have bioruby on the edge in my laptop but not on the cluster... Last question (sorry for asking everything), there is a way to generate docs of boiruby that can be queried with the ri command? ri Bio::Blat::Report Nothing known about Bio::Blat::Report Thanks! Davide Rambaldi, Bioinformatics PhD student. ----------------------------------------------------- Bioinformatic Group IFOM-IEO Campus Via Adamello 16, Milano I-20139 Italy [t] +39 02574303 066 [e] davide.rambaldi at ifom-ieo-campus.it [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/DavideRambaldi (homepage) [i] http://www.semm.it (PhD school) [i] http://www.btbs.unimib.it/ (Master) ----------------------------------------------------- From ngoto at gen-info.osaka-u.ac.jp Thu Sep 4 07:36:28 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 4 Sep 2008 20:36:28 +0900 Subject: [BioRuby] Bio::Blat::Report In-Reply-To: <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> <20080904035257.5B6EB1CBC5D9@idnmail.gen-info.osaka-u.ac.jp> <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> Message-ID: <20080904113629.834C61CBC5D5@idnmail.gen-info.osaka-u.ac.jp> On Thu, 4 Sep 2008 11:11:54 +0200 Davide Rambaldi wrote: > > On Sep 4, 2008, at 5:52 AM, Naohisa GOTO wrote: > > > This is somehow incompatible, but good at speed and memory usage. > > In addition, some people requested. > > http://lists.open-bio.org/pipermail/bioruby-ja/2007-August/000137.html > > (Mailing list written in Japanese) > > > ehm... any good translator from japanese to english (or better > italian!) ? :P Google or Yahoo can be used. Be careful they frequently mistranslate. http://www.google.com/translate_t http://babelfish.yahoo.com/ > anyway I am agree that the strange case of mixed hits can be ignored. > > This commits will be available in the next version of bioruby? Yes. > > I have bioruby on the edge in my laptop but not on the cluster... > > Last question (sorry for asking everything), there is a way to > generate docs of boiruby that can be queried with the ri command? > > ri Bio::Blat::Report > Nothing known about Bio::Blat::Report I don't know about ri, and I hope someone can answer. -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From kpatil at science.uva.nl Thu Sep 4 08:02:19 2008 From: kpatil at science.uva.nl (K. Patil) Date: Thu, 4 Sep 2008 14:02:19 +0200 (CEST) Subject: [BioRuby] problem while handling large fasta files In-Reply-To: <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> <20080904035257.5B6EB1CBC5D9@idnmail.gen-info.osaka-u.ac.jp> <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> Message-ID: <26079.84.59.22.23.1220529739.squirrel@webmail.science.uva.nl> Hi, I am trying to do some simple processing on fasta files. It works file for small files (upto several MB). But as soon as I move to very large files (e.g. 2.2 GB) the program crashes. Any help/suggestions highly appreciated. Best regards, Kaustubh Patil I am pasting a very simple example below (the file is 2.2GB); irb(main):021:0> fasta = Bio::FastaFormat.open("9606.2.fna") => #, @buffer="", @path="9606.2.fna">, @buffer=">9606.2.fna\ntaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac\n", @path="9606.2.fna">, @header=nil, @delimiter="\n>", @delimiter_overrun=1>, @firsttime_flag=true, @stream=#, @buffer="", @path="9606.2.fna">, @buffer=">9606.2.fna\ntaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac\n", @path="9606.2.fna">, @skip_leader_mode=:firsttime, @raw=false, @dbclass=Bio::FastaFormat> irb(main):022:0> fasta.each do |seq| irb(main):023:1* print seq.data irb(main):024:1> end NoMethodError: private method `sub' called for nil:NilClass from /usr/lib/ruby/1.8/bio/db/fasta.rb:156:in `initialize' from /usr/lib/ruby/1.8/bio/io/flatfile.rb:579:in `new' from /usr/lib/ruby/1.8/bio/io/flatfile.rb:579:in `next_entry' from /usr/lib/ruby/1.8/bio/io/flatfile.rb:609:in `each' from (irb):22 From ngoto at gen-info.osaka-u.ac.jp Thu Sep 4 09:01:59 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 4 Sep 2008 22:01:59 +0900 Subject: [BioRuby] problem while handling large fasta files In-Reply-To: <26079.84.59.22.23.1220529739.squirrel@webmail.science.uva.nl> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> <20080904035257.5B6EB1CBC5D9@idnmail.gen-info.osaka-u.ac.jp> <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> <26079.84.59.22.23.1220529739.squirrel@webmail.science.uva.nl> Message-ID: <20080904130200.C30ED1CBC5DC@idnmail.gen-info.osaka-u.ac.jp> Hi, Please show which BioRuby version, Ruby version, OS, architecture (type of CPU) you are using. Is the Ruby and/or BioRuby version older? Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Thu, 4 Sep 2008 14:02:19 +0200 (CEST) "K. Patil" wrote: > Hi, > > I am trying to do some simple processing on fasta files. It works file for > small files (upto several MB). But as soon as I move to very large files > (e.g. 2.2 GB) the program crashes. Any help/suggestions highly > appreciated. > > Best regards, > Kaustubh Patil > > I am pasting a very simple example below (the file is 2.2GB); > > irb(main):021:0> fasta = Bio::FastaFormat.open("9606.2.fna") > => # @splitter=# @stream=# @io=# @io=#, @buffer="", @path="9606.2.fna">, > @buffer=">9606.2.fna\ntaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac\n", > @path="9606.2.fna">, @header=nil, @delimiter="\n>", @delimiter_overrun=1>, > @firsttime_flag=true, > @stream=# @io=# @io=#, @buffer="", @path="9606.2.fna">, > @buffer=">9606.2.fna\ntaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac\n", > @path="9606.2.fna">, @skip_leader_mode=:firsttime, @raw=false, > @dbclass=Bio::FastaFormat> > irb(main):022:0> fasta.each do |seq| > irb(main):023:1* print seq.data > irb(main):024:1> end > NoMethodError: private method `sub' called for nil:NilClass > from /usr/lib/ruby/1.8/bio/db/fasta.rb:156:in `initialize' > from /usr/lib/ruby/1.8/bio/io/flatfile.rb:579:in `new' > from /usr/lib/ruby/1.8/bio/io/flatfile.rb:579:in `next_entry' > from /usr/lib/ruby/1.8/bio/io/flatfile.rb:609:in `each' > from (irb):22 > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From kpatil at science.uva.nl Thu Sep 4 09:32:27 2008 From: kpatil at science.uva.nl (K. Patil) Date: Thu, 4 Sep 2008 15:32:27 +0200 (CEST) Subject: [BioRuby] problem while handling large fasta files In-Reply-To: <20080904130200.C30ED1CBC5DC@idnmail.gen-info.osaka-u.ac.jp> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> <20080904035257.5B6EB1CBC5D9@idnmail.gen-info.osaka-u.ac.jp> <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> <26079.84.59.22.23.1220529739.squirrel@webmail.science.uva.nl> <20080904130200.C30ED1CBC5DC@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <27147.84.59.22.23.1220535147.squirrel@webmail.science.uva.nl> Oops, sorry for incomplete information. Here it is; Ruby: 1.8 Bioruby: 1.0.0 OS/CPU: 2.6.24.2.1.amd64-smp #1 SMP Mon Feb 11 12:43:21 UTC 2008 x86_64 GNU/Linux Also I cannot upgrade Ruby/Bioruby easily as I don't have appropriate permissions (all packages are installed by the administrator on request). thanks and regards, kaustubh > Hi, > > Please show which BioRuby version, Ruby version, OS, > architecture (type of CPU) you are using. > > Is the Ruby and/or BioRuby version older? > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > On Thu, 4 Sep 2008 14:02:19 +0200 (CEST) > "K. Patil" wrote: > >> Hi, >> >> I am trying to do some simple processing on fasta files. It works file >> for >> small files (upto several MB). But as soon as I move to very large files >> (e.g. 2.2 GB) the program crashes. Any help/suggestions highly >> appreciated. >> >> Best regards, >> Kaustubh Patil >> >> I am pasting a very simple example below (the file is 2.2GB); >> >> irb(main):021:0> fasta = Bio::FastaFormat.open("9606.2.fna") >> => #> @splitter=#> @stream=#> @io=#> @io=#, @buffer="", @path="9606.2.fna">, >> @buffer=">9606.2.fna\ntaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac\n", >> @path="9606.2.fna">, @header=nil, @delimiter="\n>", >> @delimiter_overrun=1>, >> @firsttime_flag=true, >> @stream=#> @io=#> @io=#, @buffer="", @path="9606.2.fna">, >> @buffer=">9606.2.fna\ntaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac\n", >> @path="9606.2.fna">, @skip_leader_mode=:firsttime, @raw=false, >> @dbclass=Bio::FastaFormat> >> irb(main):022:0> fasta.each do |seq| >> irb(main):023:1* print seq.data >> irb(main):024:1> end >> NoMethodError: private method `sub' called for nil:NilClass >> from /usr/lib/ruby/1.8/bio/db/fasta.rb:156:in `initialize' >> from /usr/lib/ruby/1.8/bio/io/flatfile.rb:579:in `new' >> from /usr/lib/ruby/1.8/bio/io/flatfile.rb:579:in `next_entry' >> from /usr/lib/ruby/1.8/bio/io/flatfile.rb:609:in `each' >> from (irb):22 >> >> >> _______________________________________________ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > From sgujja at broad.mit.edu Thu Sep 4 10:53:11 2008 From: sgujja at broad.mit.edu (Sharvari Gujja) Date: Thu, 04 Sep 2008 10:53:11 -0400 Subject: [BioRuby] Multi Fasta Sequence file to Genbank conversion.. Message-ID: <48BFF657.1080302@broad.mit.edu> Hi, I am trying to convert a multi fasta sequence file (nucleotide/protein) to genbank format.Is there a way to do this using Bioruby? Appreciate any input/suggestions. Thanks S From adamnkraut at gmail.com Thu Sep 4 19:13:25 2008 From: adamnkraut at gmail.com (Adam Kraut) Date: Thu, 4 Sep 2008 19:13:25 -0400 Subject: [BioRuby] Multi Fasta Sequence file to Genbank conversion.. In-Reply-To: <48BFF657.1080302@broad.mit.edu> References: <48BFF657.1080302@broad.mit.edu> Message-ID: <134ede0b0809041613o4cd536ddwa183f88f377e108b@mail.gmail.com> I've never used the genbank format, but in Bioruby you could try: include Bio fasta = Alignment::MultiFastaFormat.new(File.open('my.fasta').read) fasta.entries.each do |seq| puts seq.to_seq.output(:genbank) end The only tricky part is perhaps is the to_seq call for a Bio::Sequence object which has different output format methods. -Adam On Thu, Sep 4, 2008 at 10:53 AM, Sharvari Gujja wrote: > Hi, > > I am trying to convert a multi fasta sequence file (nucleotide/protein) to > genbank format.Is there a way to do this using Bioruby? > > Appreciate any input/suggestions. > > Thanks > S > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From ngoto at gen-info.osaka-u.ac.jp Thu Sep 4 21:34:26 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Fri, 5 Sep 2008 10:34:26 +0900 Subject: [BioRuby] Multi Fasta Sequence file to Genbank conversion.. In-Reply-To: <134ede0b0809041613o4cd536ddwa183f88f377e108b@mail.gmail.com> References: <48BFF657.1080302@broad.mit.edu> <134ede0b0809041613o4cd536ddwa183f88f377e108b@mail.gmail.com> Message-ID: <20080905013427.D6BEE1CBC5EF@idnmail.gen-info.osaka-u.ac.jp> Hi, On Thu, 4 Sep 2008 19:13:25 -0400 "Adam Kraut" wrote: > I've never used the genbank format, but in Bioruby you could try: > > include Bio > > fasta = Alignment::MultiFastaFormat.new(File.open('my.fasta').read) > fasta.entries.each do |seq| > puts seq.to_seq.output(:genbank) > end No need to use Bio::Alignment::MultiFastaFormat in this case. Bio::FlatFile alone can do. For example, to read from stdin and output to stdout, require 'bio' Bio::FlatFile.open($<) do |ff| ff.each do |e| print e.to_biosequence.output(:genbank) end end Note that the output(:genbank) are new feature only in the latest development version in the git repository. http://github.com/bioruby/bioruby (i.e. in BioRuby 1.2.1, above examples cannot be run.) Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > The only tricky part is perhaps is the to_seq call for a Bio::Sequence > object which has different output format methods. > -Adam > > On Thu, Sep 4, 2008 at 10:53 AM, Sharvari Gujja wrote: > > > Hi, > > > > I am trying to convert a multi fasta sequence file (nucleotide/protein) to > > genbank format.Is there a way to do this using Bioruby? > > > > Appreciate any input/suggestions. > > > > Thanks > > S > > _______________________________________________ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From ngoto at gen-info.osaka-u.ac.jp Thu Sep 4 21:47:21 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Fri, 5 Sep 2008 10:47:21 +0900 Subject: [BioRuby] problem while handling large fasta files In-Reply-To: <27147.84.59.22.23.1220535147.squirrel@webmail.science.uva.nl> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> <20080904035257.5B6EB1CBC5D9@idnmail.gen-info.osaka-u.ac.jp> <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> <26079.84.59.22.23.1220529739.squirrel@webmail.science.uva.nl> <20080904130200.C30ED1CBC5DC@idnmail.gen-info.osaka-u.ac.jp> <27147.84.59.22.23.1220535147.squirrel@webmail.science.uva.nl> Message-ID: <20080905014722.C0BFE1CBC5FE@idnmail.gen-info.osaka-u.ac.jp> On Thu, 4 Sep 2008 15:32:27 +0200 (CEST) "K. Patil" wrote: > Oops, sorry for incomplete information. Here it is; > > Ruby: 1.8 > Bioruby: 1.0.0 > OS/CPU: 2.6.24.2.1.amd64-smp #1 SMP Mon Feb 11 12:43:21 UTC 2008 x86_64 > GNU/Linux The BioRuby 1.0.0 is too old! The only thing I can say is the problem may not occur in the latest version of BioRuby, at least 1.2.1. > Also I cannot upgrade Ruby/Bioruby easily as I don't have appropriate > permissions (all packages are installed by the administrator on request). BioRuby (and also Ruby) can be installed in your home directory, without root (administrator) permission. The simplest way is: % cd somewhere % wget http://bioruby.open-bio.org/archive/bioruby-1.2.1.tar.gz % tar zxvf bioruby-1.2.1.tar.gz And then, when running your script, % ruby -I /full/path/to/somewhere/bioruby-1.2.1/lib example.rb (The "/full/path/to/somewhere" is the path you extracted the bioruby archive.) If you want to use irb, % ruby -I /full/path/to/somewhere/bioruby-1.2.1/lib -r bio Alternatively, put $LOAD_PATH.unshift("/full/path/to/somewhere/bioruby-1.2.1/lib") before the require 'bio' in your script. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > thanks and regards, > kaustubh > > > > Hi, > > > > Please show which BioRuby version, Ruby version, OS, > > architecture (type of CPU) you are using. > > > > Is the Ruby and/or BioRuby version older? > > > > Naohisa Goto > > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > > > On Thu, 4 Sep 2008 14:02:19 +0200 (CEST) > > "K. Patil" wrote: > > > >> Hi, > >> > >> I am trying to do some simple processing on fasta files. It works file > >> for > >> small files (upto several MB). But as soon as I move to very large files > >> (e.g. 2.2 GB) the program crashes. Any help/suggestions highly > >> appreciated. > >> > >> Best regards, > >> Kaustubh Patil > >> > >> I am pasting a very simple example below (the file is 2.2GB); > >> > >> irb(main):021:0> fasta = Bio::FastaFormat.open("9606.2.fna") > >> => # >> @splitter=# >> @stream=# >> @io=# >> @io=#, @buffer="", @path="9606.2.fna">, > >> @buffer=">9606.2.fna\ntaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac\n", > >> @path="9606.2.fna">, @header=nil, @delimiter="\n>", > >> @delimiter_overrun=1>, > >> @firsttime_flag=true, > >> @stream=# >> @io=# >> @io=#, @buffer="", @path="9606.2.fna">, > >> @buffer=">9606.2.fna\ntaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac\n", > >> @path="9606.2.fna">, @skip_leader_mode=:firsttime, @raw=false, > >> @dbclass=Bio::FastaFormat> > >> irb(main):022:0> fasta.each do |seq| > >> irb(main):023:1* print seq.data > >> irb(main):024:1> end > >> NoMethodError: private method `sub' called for nil:NilClass > >> from /usr/lib/ruby/1.8/bio/db/fasta.rb:156:in `initialize' > >> from /usr/lib/ruby/1.8/bio/io/flatfile.rb:579:in `new' > >> from /usr/lib/ruby/1.8/bio/io/flatfile.rb:579:in `next_entry' > >> from /usr/lib/ruby/1.8/bio/io/flatfile.rb:609:in `each' > >> from (irb):22 > >> > >> > >> _______________________________________________ > >> BioRuby mailing list > >> BioRuby at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > > > > > > From tomoakin at kenroku.kanazawa-u.ac.jp Fri Sep 5 05:21:02 2008 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Fri, 5 Sep 2008 18:21:02 +0900 Subject: [BioRuby] GFF attributes Message-ID: Hi, When extracting attributes from a GFF file, older implementation seem to have eat the last character before ";". Current, (downloaded very recently from github), does not split well, as the regular expression search the largest match. A patch is included, but I am not sure on the specification. http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml The specification says: > From version 2 onwards, the attribute field must have an tag value > structure following the syntax used within objects in a .ace file, > flattened onto one line by semicolon separators. Tags must be > standard identifiers ([A-Za-z][A-Za-z0-9_]*). Free text values must > be quoted with double quotes. Note: all non-printing characters in > such free text value strings (e.g. newlines, tabs, control > characters, etc) must be explicitly represented by their C (UNIX) > style backslash-escaped representation (e.g. newlines as '\n', tabs > as '\t'). So, it seems that for proper parsing, quotation with double quote should be checked for free text, and semicolon in that quatation is not a separator for attributes and semicolon may not be preceeded with back slash. Anyway, the file I am looking now is not that complex, and I will go with a quick hack at this time. Best regards, Tomoaki the test program $ cat test-gff.rb #!/usr/local/bin/ruby require 'bio' gff_str = "LG_I\tJGI\tCDS\t11052\t11064\t.\t-\t0\tname \"grail3.0116000101\"; proteinId 639579; exonNumber 3\n" Bio::GFF.new(gff_str).records.each do |fr| p fr end output after patch $ /usr/local/bin/ruby test-gff.rb #"\"grail3.0116000101\"", "proteinId"=>"639579", "exonNumber"=>"3"}, @end="11064", @seqname="LG_I"> output from current #"\"grail3.0116000101\"; proteinId 639579", "exonNumber"=>"3"}, @end="11064", @seqname="LG_I"> older output #"\"grail3.0116000101", "proteinId"=>"63957", "exonNumber"=>"3"}> diff -ur bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ bio/db/gff.rb bioruby-a/lib/bio/db/gff.rb --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ db/gff.rb 2008-09-03 22:24:39.000000000 +0900 +++ bioruby-a/lib/bio/db/gff.rb 2008-09-05 14:56:50.000000000 +0900 @@ -122,7 +122,7 @@ def parse_attributes(attributes) hash = Hash.new scanner = StringScanner.new(attributes) - while scanner.scan(/(.*[^\\])\;/) or scanner.scan(/(.+)/) + while scanner.scan(/(([^;]|\\;)*[^\\])\;/) or scanner.scan(/ (.+)/) key, value = scanner[1].split(' ', 2) key.strip! value.strip! if value -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan From tomoakin at kenroku.kanazawa-u.ac.jp Fri Sep 5 05:25:30 2008 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Fri, 5 Sep 2008 18:25:30 +0900 Subject: [BioRuby] Bio::Blat::Report In-Reply-To: <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> <20080904035257.5B6EB1CBC5D9@idnmail.gen-info.osaka-u.ac.jp> <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> Message-ID: Hi, > ehm... any good translator from japanese to english (or better > italian!) ? :P Here is a translation by the original sender: -- start of translation I am Nishiyama at Kanazawa. When a multifasta file is used as queries, unlike blast, blat does not output a header, but instead outputs the query and target id in each line. Bio::Blat::Report, in accordance with that behavior, seems to return one entry with many hits. However, as a user, searching with a split file for each query is undesired, while the results is desired to be aggregated for each query. For example when you want the best hit location for each query. Although, there is no separator in the output of blat, the result for the same query comes continuously. When processing as a FlatFile, it would be useful to return a block with the same query name as an "entry", I made "flatfile_splitter". Because each line is parsed for determination of split positioin, return value were made as an Array of Hit, so that Hit.new need not be called again. (For the speed this would about 20% difference.) When processing a psl file of 100-200 Mbytes, more than several Gbytes of memory were required with a system reading the whole data into a Hash and processing the hits for each query, but with this system much smaller memory is sufficient. What do you think? -- end of translation The remainder are the diff of the source code. Note that the name of class and file are changed to avoid collision and the behavior of the original class is not changed. On 2008/09/04, at 18:11, Davide Rambaldi wrote: > > On Sep 4, 2008, at 5:52 AM, Naohisa GOTO wrote: > >> This is somehow incompatible, but good at speed and memory usage. >> In addition, some people requested. >> http://lists.open-bio.org/pipermail/bioruby-ja/2007-August/ >> 000137.html >> (Mailing list written in Japanese) > > > ehm... any good translator from japanese to english (or better > italian!) ? :P > > anyway I am agree that the strange case of mixed hits can be ignored. > > This commits will be available in the next version of bioruby? > > I have bioruby on the edge in my laptop but not on the cluster... > > Last question (sorry for asking everything), there is a way to > generate docs of boiruby that can be queried with the ri command? > > ri Bio::Blat::Report > Nothing known about Bio::Blat::Report > > > Thanks! > > Davide Rambaldi, > Bioinformatics PhD student. > ----------------------------------------------------- > Bioinformatic Group IFOM-IEO Campus > Via Adamello 16, Milano > I-20139 Italy > > [t] +39 02574303 066 > [e] davide.rambaldi at ifom-ieo-campus.it > [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/ > DavideRambaldi (homepage) > [i] http://www.semm.it (PhD school) > [i] http://www.btbs.unimib.it/ (Master) > > ----------------------------------------------------- > > > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan From davide.rambaldi at ifom-ieo-campus.it Fri Sep 5 06:47:06 2008 From: davide.rambaldi at ifom-ieo-campus.it (Davide Rambaldi) Date: Fri, 5 Sep 2008 12:47:06 +0200 Subject: [BioRuby] Bio::Blat::Report In-Reply-To: References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> <20080904035257.5B6EB1CBC5D9@idnmail.gen-info.osaka-u.ac.jp> <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> Message-ID: <17FB5E9F-4D32-4F75-89EC-FF1E0BE1A24F@ifom-ieo-campus.it> On Sep 5, 2008, at 11:25 AM, Tomoaki NISHIYAMA wrote: > Hi, > >> ehm... any good translator from japanese to english (or better >> italian!) ? :P > > Here is a translation by the original sender: > dear Nishiyama thanks for translation to follow the discussion: I am agree, the splitter work well and is fast (create an hash can be a problem with big files). I am grouping queries in my script (in bioruby 1.2.1, not the last git release) with group_by and query.name that return an Hash as you say. Also for my sorting operation (sorting by score, coverage, identity, etc...) is better to work in a small array with only the hits related to one query. Soon I will put somewhere the code for my blatanalyzer.... (ruby version), any suggestion on where to put it? thanks for the kindly translation Davide > -- start of translation > I am Nishiyama at Kanazawa. > > When a multifasta file is used as queries, unlike blast, > blat does not output a header, but instead > outputs the query and target id in each line. > > Bio::Blat::Report, in accordance with that > behavior, seems to return one entry with many > hits. However, as a user, searching with a split file for each query > is undesired, while the results is desired to be aggregated for > each query. > For example when you want the best hit location for each query. > > Although, there is no separator in the output of blat, the result > for the same query comes continuously. > When processing as a FlatFile, it would be useful > to return a block with the same query name as an "entry", > I made "flatfile_splitter". > Because each line is parsed for determination of split positioin, > return value were made as an Array of Hit, so that Hit.new > need not be called again. (For the speed this would about 20% > difference.) > > When processing a psl file of 100-200 Mbytes, more than several > Gbytes of > memory were required with a system reading the whole data into > a Hash and processing the hits for each query, > but with this system much smaller memory is sufficient. > > What do you think? > > -- end of translation > > The remainder are the diff of the source code. > Note that the name of class and file are changed to avoid collision > and the > behavior of the original class is not changed. > > On 2008/09/04, at 18:11, Davide Rambaldi wrote: > >> >> On Sep 4, 2008, at 5:52 AM, Naohisa GOTO wrote: >> >>> This is somehow incompatible, but good at speed and memory usage. >>> In addition, some people requested. >>> http://lists.open-bio.org/pipermail/bioruby-ja/2007-August/ >>> 000137.html >>> (Mailing list written in Japanese) >> >> >> ehm... any good translator from japanese to english (or better >> italian!) ? :P >> >> anyway I am agree that the strange case of mixed hits can be ignored. >> >> This commits will be available in the next version of bioruby? >> >> I have bioruby on the edge in my laptop but not on the cluster... >> >> Last question (sorry for asking everything), there is a way to >> generate docs of boiruby that can be queried with the ri command? >> >> ri Bio::Blat::Report >> Nothing known about Bio::Blat::Report >> >> >> Thanks! >> >> Davide Rambaldi, >> Bioinformatics PhD student. >> ----------------------------------------------------- >> Bioinformatic Group IFOM-IEO Campus >> Via Adamello 16, Milano >> I-20139 Italy >> >> [t] +39 02574303 066 >> [e] davide.rambaldi at ifom-ieo-campus.it >> [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/ >> DavideRambaldi (homepage) >> [i] http://www.semm.it (PhD school) >> [i] http://www.btbs.unimib.it/ (Master) >> >> ----------------------------------------------------- >> >> >> >> >> _______________________________________________ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > > -- > Tomoaki NISHIYAMA > > Advanced Science Research Center, > Kanazawa University, > 13-1 Takara-machi, > Kanazawa, 920-0934, Japan > > Davide Rambaldi, Bioinformatics PhD student. ----------------------------------------------------- Bioinformatic Group IFOM-IEO Campus Via Adamello 16, Milano I-20139 Italy [t] +39 02574303 066 [e] davide.rambaldi at ifom-ieo-campus.it [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/DavideRambaldi (homepage) [i] http://www.semm.it (PhD school) [i] http://www.btbs.unimib.it/ (Master) ----------------------------------------------------- From donttrustben at gmail.com Fri Sep 5 09:12:18 2008 From: donttrustben at gmail.com (Ben Woodcroft) Date: Fri, 5 Sep 2008 23:12:18 +1000 Subject: [BioRuby] problem while handling large fasta files In-Reply-To: <20080905014722.C0BFE1CBC5FE@idnmail.gen-info.osaka-u.ac.jp> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> <20080904035257.5B6EB1CBC5D9@idnmail.gen-info.osaka-u.ac.jp> <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> <26079.84.59.22.23.1220529739.squirrel@webmail.science.uva.nl> <20080904130200.C30ED1CBC5DC@idnmail.gen-info.osaka-u.ac.jp> <27147.84.59.22.23.1220535147.squirrel@webmail.science.uva.nl> <20080905014722.C0BFE1CBC5FE@idnmail.gen-info.osaka-u.ac.jp> Message-ID: Or you could use the RUBYLIB environment variable - set it to your bioruby lib/ directory and then you don't have to modify your scripts at all. The advantage of doing this is that your choice of gem/github bioruby version doesn't impact your scripts at all, and so when you change it is much easier. 2008/9/5 Naohisa GOTO > On Thu, 4 Sep 2008 15:32:27 +0200 (CEST) > "K. Patil" wrote: > > > Oops, sorry for incomplete information. Here it is; > > > > Ruby: 1.8 > > Bioruby: 1.0.0 > > OS/CPU: 2.6.24.2.1.amd64-smp #1 SMP Mon Feb 11 12:43:21 UTC 2008 x86_64 > > GNU/Linux > > The BioRuby 1.0.0 is too old! > > The only thing I can say is the problem may not occur > in the latest version of BioRuby, at least 1.2.1. > > > Also I cannot upgrade Ruby/Bioruby easily as I don't have appropriate > > permissions (all packages are installed by the administrator on request). > > BioRuby (and also Ruby) can be installed in your home directory, > without root (administrator) permission. > > The simplest way is: > > % cd somewhere > % wget http://bioruby.open-bio.org/archive/bioruby-1.2.1.tar.gz > % tar zxvf bioruby-1.2.1.tar.gz > > And then, when running your script, > > % ruby -I /full/path/to/somewhere/bioruby-1.2.1/lib example.rb > (The "/full/path/to/somewhere" is the path you extracted > the bioruby archive.) > > If you want to use irb, > > % ruby -I /full/path/to/somewhere/bioruby-1.2.1/lib -r bio > > Alternatively, put > > $LOAD_PATH.unshift("/full/path/to/somewhere/bioruby-1.2.1/lib") > > before the require 'bio' in your script. > > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > > > > > thanks and regards, > > kaustubh > > > > > > > Hi, > > > > > > Please show which BioRuby version, Ruby version, OS, > > > architecture (type of CPU) you are using. > > > > > > Is the Ruby and/or BioRuby version older? > > > > > > Naohisa Goto > > > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > > > > > On Thu, 4 Sep 2008 14:02:19 +0200 (CEST) > > > "K. Patil" wrote: > > > > > >> Hi, > > >> > > >> I am trying to do some simple processing on fasta files. It works file > > >> for > > >> small files (upto several MB). But as soon as I move to very large > files > > >> (e.g. 2.2 GB) the program crashes. Any help/suggestions highly > > >> appreciated. > > >> > > >> Best regards, > > >> Kaustubh Patil > > >> > > >> I am pasting a very simple example below (the file is 2.2GB); > > >> > > >> irb(main):021:0> fasta = Bio::FastaFormat.open("9606.2.fna") > > >> => # > >> @splitter=# > >> @stream=# > >> @io=# > >> @io=#, @buffer="", @path="9606.2.fna">, > > >> > @buffer=">9606.2.fna\ntaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac\n", > > >> @path="9606.2.fna">, @header=nil, @delimiter="\n>", > > >> @delimiter_overrun=1>, > > >> @firsttime_flag=true, > > >> @stream=# > >> @io=# > >> @io=#, @buffer="", @path="9606.2.fna">, > > >> > @buffer=">9606.2.fna\ntaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac\n", > > >> @path="9606.2.fna">, @skip_leader_mode=:firsttime, @raw=false, > > >> @dbclass=Bio::FastaFormat> > > >> irb(main):022:0> fasta.each do |seq| > > >> irb(main):023:1* print seq.data > > >> irb(main):024:1> end > > >> NoMethodError: private method `sub' called for nil:NilClass > > >> from /usr/lib/ruby/1.8/bio/db/fasta.rb:156:in `initialize' > > >> from /usr/lib/ruby/1.8/bio/io/flatfile.rb:579:in `new' > > >> from /usr/lib/ruby/1.8/bio/io/flatfile.rb:579:in `next_entry' > > >> from /usr/lib/ruby/1.8/bio/io/flatfile.rb:609:in `each' > > >> from (irb):22 > > >> > > >> > > >> _______________________________________________ > > >> BioRuby mailing list > > >> BioRuby at lists.open-bio.org > > >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > > > > > > > > > > > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > -- FYI: My email addresses at unimelb, uq and gmail all redirect to the same place. From sgujja at broad.mit.edu Fri Sep 5 10:22:57 2008 From: sgujja at broad.mit.edu (Sharvari Gujja) Date: Fri, 05 Sep 2008 10:22:57 -0400 Subject: [BioRuby] Multi Fasta Sequence file to Genbank conversion.. In-Reply-To: <20080905013427.D6BEE1CBC5EF@idnmail.gen-info.osaka-u.ac.jp> References: <48BFF657.1080302@broad.mit.edu> <134ede0b0809041613o4cd536ddwa183f88f377e108b@mail.gmail.com> <20080905013427.D6BEE1CBC5EF@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <48C140C1.1010905@broad.mit.edu> Hi, Thank you so much for the reply. However, I get the following error on running this code: *require 'bio' Bio::FlatFile.open($<) do |ff| ff.each do |e| print e.to_biosequence.output(:genbank) end end* undefined method `to_biosequence' for # (NoMethodError) And running this code gives me: *include Bio fasta = Alignment::MultiFastaFormat.new(File.open('my.fasta').read) fasta.entries.each do |seq| puts seq.to_seq.output(:genbank) end* uninitialized constant Alignment (NameError)... I guess this is something to do with rubygems. Also, I believe this would generate a genbank file for each sequence in the multi-fasta file. Is there a way to get single Genbank file for the multi-fasta sequence file? Appreciate all the help. Thanks S Naohisa GOTO wrote: > Hi, > > On Thu, 4 Sep 2008 19:13:25 -0400 > "Adam Kraut" wrote: > > >> I've never used the genbank format, but in Bioruby you could try: >> >> include Bio >> >> fasta = Alignment::MultiFastaFormat.new(File.open('my.fasta').read) >> fasta.entries.each do |seq| >> puts seq.to_seq.output(:genbank) >> end >> > > No need to use Bio::Alignment::MultiFastaFormat in this case. > Bio::FlatFile alone can do. > > For example, to read from stdin and output to stdout, > > require 'bio' > Bio::FlatFile.open($<) do |ff| > ff.each do |e| > print e.to_biosequence.output(:genbank) > end > end > > Note that the output(:genbank) are new feature only in > the latest development version in the git repository. > http://github.com/bioruby/bioruby > (i.e. in BioRuby 1.2.1, above examples cannot be run.) > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > >> The only tricky part is perhaps is the to_seq call for a Bio::Sequence >> object which has different output format methods. >> -Adam >> >> On Thu, Sep 4, 2008 at 10:53 AM, Sharvari Gujja wrote: >> >> >>> Hi, >>> >>> I am trying to convert a multi fasta sequence file (nucleotide/protein) to >>> genbank format.Is there a way to do this using Bioruby? >>> >>> Appreciate any input/suggestions. >>> >>> Thanks >>> S >>> _______________________________________________ >>> BioRuby mailing list >>> BioRuby at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioruby >>> >>> >> _______________________________________________ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby >> > > > > From adamnkraut at gmail.com Fri Sep 5 11:00:31 2008 From: adamnkraut at gmail.com (Adam Kraut) Date: Fri, 5 Sep 2008 11:00:31 -0400 Subject: [BioRuby] Multi Fasta Sequence file to Genbank conversion.. In-Reply-To: <48C140C1.1010905@broad.mit.edu> References: <48BFF657.1080302@broad.mit.edu> <134ede0b0809041613o4cd536ddwa183f88f377e108b@mail.gmail.com> <20080905013427.D6BEE1CBC5EF@idnmail.gen-info.osaka-u.ac.jp> <48C140C1.1010905@broad.mit.edu> Message-ID: <134ede0b0809050800h3e3fbd43s9845bb4798bd37df@mail.gmail.com> Naohisa, thanks for clearing that up. I knew there was a better way ;) Sharvari, which version of Bioruby have you installed? Both examples will print everything to stdout, which you can redirect to a single file. On Fri, Sep 5, 2008 at 10:22 AM, Sharvari Gujja wrote: > Hi, > > Thank you so much for the reply. > > However, I get the following error on running this code: > > *require 'bio' > Bio::FlatFile.open($<) do |ff| > ff.each do |e| > print e.to_biosequence.output(:genbank) > end > end* > > > undefined method `to_biosequence' for # > (NoMethodError) > > And running this code gives me: > > *include Bio > > fasta = Alignment::MultiFastaFormat.new(File.open('my.fasta').read) > fasta.entries.each do |seq| > puts seq.to_seq.output(:genbank) > end* > > uninitialized constant Alignment (NameError)... > > I guess this is something to do with rubygems. > > Also, I believe this would generate a genbank file for each sequence in the > multi-fasta file. Is there a way to get single Genbank file for the > multi-fasta sequence file? > > Appreciate all the help. > > Thanks > S > > > Naohisa GOTO wrote: > >> Hi, >> >> On Thu, 4 Sep 2008 19:13:25 -0400 >> "Adam Kraut" wrote: >> >> >> >>> I've never used the genbank format, but in Bioruby you could try: >>> >>> include Bio >>> >>> fasta = Alignment::MultiFastaFormat.new(File.open('my.fasta').read) >>> fasta.entries.each do |seq| >>> puts seq.to_seq.output(:genbank) >>> end >>> >>> >> >> No need to use Bio::Alignment::MultiFastaFormat in this case. >> Bio::FlatFile alone can do. >> >> For example, to read from stdin and output to stdout, >> >> require 'bio' >> Bio::FlatFile.open($<) do |ff| >> ff.each do |e| >> print e.to_biosequence.output(:genbank) >> end >> end >> >> Note that the output(:genbank) are new feature only in >> the latest development version in the git repository. >> http://github.com/bioruby/bioruby >> (i.e. in BioRuby 1.2.1, above examples cannot be run.) >> >> Naohisa Goto >> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org >> >> >> >>> The only tricky part is perhaps is the to_seq call for a Bio::Sequence >>> object which has different output format methods. >>> -Adam >>> >>> On Thu, Sep 4, 2008 at 10:53 AM, Sharvari Gujja >> >wrote: >>> >>> >>> >>>> Hi, >>>> >>>> I am trying to convert a multi fasta sequence file (nucleotide/protein) >>>> to >>>> genbank format.Is there a way to do this using Bioruby? >>>> >>>> Appreciate any input/suggestions. >>>> >>>> Thanks >>>> S >>>> _______________________________________________ >>>> BioRuby mailing list >>>> BioRuby at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioruby >>>> >>>> >>>> >>>> >>> _______________________________________________ >>> BioRuby mailing list >>> BioRuby at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioruby >>> >>> >> >> >> >> >> > From sgujja at broad.mit.edu Fri Sep 5 11:11:36 2008 From: sgujja at broad.mit.edu (Sharvari Gujja) Date: Fri, 05 Sep 2008 11:11:36 -0400 Subject: [BioRuby] Multi Fasta Sequence file to Genbank conversion.. In-Reply-To: <134ede0b0809050800h3e3fbd43s9845bb4798bd37df@mail.gmail.com> References: <48BFF657.1080302@broad.mit.edu> <134ede0b0809041613o4cd536ddwa183f88f377e108b@mail.gmail.com> <20080905013427.D6BEE1CBC5EF@idnmail.gen-info.osaka-u.ac.jp> <48C140C1.1010905@broad.mit.edu> <134ede0b0809050800h3e3fbd43s9845bb4798bd37df@mail.gmail.com> Message-ID: <48C14C28.2000903@broad.mit.edu> Hi Adam, I am using bioruby version is 1.2.1. How can I upgrade to the new version? Also,the final output file would contain genbank format for each fasta sequence right? I am interested in getting a single genabank file for all the sequences. Thanks S Adam Kraut wrote: > Naohisa, thanks for clearing that up. I knew there was a better way ;) > > Sharvari, which version of Bioruby have you installed? Both examples > will print everything to stdout, which you can redirect to a single file. > > On Fri, Sep 5, 2008 at 10:22 AM, Sharvari Gujja > wrote: > > Hi, > > Thank you so much for the reply. > > However, I get the following error on running this code: > > *require 'bio' > > Bio::FlatFile.open($<) do |ff| > ff.each do |e| > print e.to_biosequence.output(:genbank) > end > end* > > > undefined method `to_biosequence' for > # (NoMethodError) > > And running this code gives me: > > *include Bio > > > fasta = Alignment::MultiFastaFormat.new(File.open('my.fasta').read) > fasta.entries.each do |seq| > puts seq.to_seq.output(:genbank) > end* > > uninitialized constant Alignment (NameError)... > > I guess this is something to do with rubygems. > > Also, I believe this would generate a genbank file for each > sequence in the multi-fasta file. Is there a way to get single > Genbank file for the multi-fasta sequence file? > > Appreciate all the help. > > Thanks > S > > > Naohisa GOTO wrote: > > Hi, > > On Thu, 4 Sep 2008 19:13:25 -0400 > "Adam Kraut" > wrote: > > > > I've never used the genbank format, but in Bioruby you > could try: > > include Bio > > fasta = > Alignment::MultiFastaFormat.new(File.open('my.fasta').read) > fasta.entries.each do |seq| > puts seq.to_seq.output(:genbank) > end > > > > No need to use Bio::Alignment::MultiFastaFormat in this case. > Bio::FlatFile alone can do. > > For example, to read from stdin and output to stdout, > > require 'bio' > Bio::FlatFile.open($<) do |ff| > ff.each do |e| > print e.to_biosequence.output(:genbank) > end > end > > Note that the output(:genbank) are new feature only in > the latest development version in the git repository. > http://github.com/bioruby/bioruby > (i.e. in BioRuby 1.2.1, above examples cannot be run.) > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp > / ng at bioruby.org > > > > > The only tricky part is perhaps is the to_seq call for a > Bio::Sequence > object which has different output format methods. > -Adam > > On Thu, Sep 4, 2008 at 10:53 AM, Sharvari Gujja > >wrote: > > > > Hi, > > I am trying to convert a multi fasta sequence file > (nucleotide/protein) to > genbank format.Is there a way to do this using Bioruby? > > Appreciate any input/suggestions. > > Thanks > S > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > > > > > > > > > From raoul.bonnal at itb.cnr.it Fri Sep 5 03:42:30 2008 From: raoul.bonnal at itb.cnr.it (Raoul Jean Pierre Bonnal) Date: Fri, 05 Sep 2008 09:42:30 +0200 Subject: [BioRuby] problem while handling large fasta files In-Reply-To: <20080905014722.C0BFE1CBC5FE@idnmail.gen-info.osaka-u.ac.jp> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> <20080904035257.5B6EB1CBC5D9@idnmail.gen-info.osaka-u.ac.jp> <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> <26079.84.59.22.23.1220529739.squirrel@webmail.science.uva.nl> <20080904130200.C30ED1CBC5DC@idnmail.gen-info.osaka-u.ac.jp> <27147.84.59.22.23.1220535147.squirrel@webmail.science.uva.nl> <20080905014722.C0BFE1CBC5FE@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <1220600550.7632.3.camel@454-2> Il giorno ven, 05/09/2008 alle 10.47 +0900, Naohisa GOTO ha scritto: > On Thu, 4 Sep 2008 15:32:27 +0200 (CEST) > "K. Patil" wrote: > > > Oops, sorry for incomplete information. Here it is; > > > > Ruby: 1.8 > > Bioruby: 1.0.0 > > OS/CPU: 2.6.24.2.1.amd64-smp #1 SMP Mon Feb 11 12:43:21 UTC 2008 x86_64 > > GNU/Linux > > The BioRuby 1.0.0 is too old! and use the latest Ruby release, I had some problem handling huge data with 1.8.6 -- Ra From tomoakin at kenroku.kanazawa-u.ac.jp Fri Sep 5 02:43:05 2008 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Fri, 5 Sep 2008 15:43:05 +0900 Subject: [BioRuby] GFF attributes Message-ID: Hi, When extracting attributes from a GFF file, older implementation seem to have eat the last character before ";". Current, (downloaded very recently from github), does not split well, as the regular expression search the largest match. A patch is included, but I am not sure on the specification. http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml The specification says: > From version 2 onwards, the attribute field must have an tag value > structure following the syntax used within objects in a .ace file, > flattened onto one line by semicolon separators. Tags must be > standard identifiers ([A-Za-z][A-Za-z0-9_]*). Free text values must > be quoted with double quotes. Note: all non-printing characters in > such free text value strings (e.g. newlines, tabs, control > characters, etc) must be explicitly represented by their C (UNIX) > style backslash-escaped representation (e.g. newlines as '\n', tabs > as '\t'). So, it seems that for proper parsing, quotation with double quote should be checked for free text, and semicolon in that quatation is not a separator for attributes and semicolon may not be preceeded with back slash. Anyway, the file I am looking now is not that complex, and I will go with a quick hack at this time. Best regards, Tomoaki the test program $ cat test-gff.rb #!/usr/local/bin/ruby require 'bio' gff_str = "LG_I\tJGI\tCDS\t11052\t11064\t.\t-\t0\tname \"grail3.0116000101\"; proteinId 639579; exonNumber 3\n" Bio::GFF.new(gff_str).records.each do |fr| p fr end output after patch $ /usr/local/bin/ruby test-gff.rb #"\"grail3.0116000101\"", "proteinId"=>"639579", "exonNumber"=>"3"}, @end="11064", @seqname="LG_I"> output from current #"\"grail3.0116000101\"; proteinId 639579", "exonNumber"=>"3"}, @end="11064", @seqname="LG_I"> older output #"\"grail3.0116000101", "proteinId"=>"63957", "exonNumber"=>"3"}> diff -ur bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ bio/db/gff.rb bioruby-a/lib/bio/db/gff.rb --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ db/gff.rb 2008-09-03 22:24:39.000000000 +0900 +++ bioruby-a/lib/bio/db/gff.rb 2008-09-05 14:56:50.000000000 +0900 @@ -122,7 +122,7 @@ def parse_attributes(attributes) hash = Hash.new scanner = StringScanner.new(attributes) - while scanner.scan(/(.*[^\\])\;/) or scanner.scan(/(.+)/) + while scanner.scan(/(([^;]|\\;)*[^\\])\;/) or scanner.scan(/ (.+)/) key, value = scanner[1].split(' ', 2) key.strip! value.strip! if value -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan From davide.rambaldi at ifom-ieo-campus.it Mon Sep 8 08:31:45 2008 From: davide.rambaldi at ifom-ieo-campus.it (Davide Rambaldi) Date: Mon, 8 Sep 2008 14:31:45 +0200 Subject: [BioRuby] blatanalyzer.rb Message-ID: <606898F4-9458-422A-9E42-ECC852BD7967@ifom-ieo-campus.it> Hi, I have published a first version of my command-line "application" that use BioRuby: blatanalyzer at http://rubyforge.org/projects/ blatanalyzer/. Blatanalyzer is a software to analize the output of blat software alignment (PSL files): list query names,sort by identity, coverage, score, span. convert to: gff, gtf formats generate: report tables, PSL, GFF and GTF files Available Actions: gff, list, cut, duplicates, gtf, report, singletons, table, summary gff,gtf: conversion to gff,gtf list: generate list of query names cut: extract psl alignments over/under a given threshold (identity, span, coverage, score) report, table, summary: generate pretty reports, table is like the web-blat output table, report is a custom table with coverage and span, summary print a list of query names with number of alignments and number of distinct chromosomes target. more actions coming... basically is composed by: - an OptionParser class - a Bio::Blat::Application class (implement actions) - a Bio::Blat::Analyzer class (subclass of Bio::Blat::Report) Any suggestion is really appreciated! Davide Rambaldi, Bioinformatics PhD student. ----------------------------------------------------- Bioinformatic Group IFOM-IEO Campus Via Adamello 16, Milano I-20139 Italy [t] +39 02574303 066 [e] davide.rambaldi at ifom-ieo-campus.it [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/DavideRambaldi (homepage) [i] http://www.semm.it (PhD school) [i] http://www.btbs.unimib.it/ (Master) ----------------------------------------------------- From pjotr2008 at thebird.nl Tue Sep 9 07:38:16 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Tue, 9 Sep 2008 13:38:16 +0200 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080902091958.GA31400@thebird.nl> References: <20080902065055.GA29634@thebird.nl> <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> <20080902091958.GA31400@thebird.nl> Message-ID: <20080909113816.GA10051@thebird.nl> I wrote a simple file Cache Singleton. See: http://github.com/pjotrp/bioruby/tree/462614487767568f41db03d894875a3d78ced08e/lib/bio/db/microarray/cache.rb The Cache can be read and set with: dir = Bio::Microarray::Cache.instance.directory('GEO') # override cache dir dir = Bio::Microarray::Cache.instance.set(newcachedir,'GEO') Everyone OK with this? Pj. On Tue, Sep 02, 2008 at 11:19:58AM +0200, Pjotr Prins wrote: > > Note that some classes use Tempfile class, a standard bundled > > class with Ruby by default, and the Tempfile class depends > > on enviroment variables (TMPDIR, TMP, etc.). > > I noticed. Caching is a bit different in nature - as caches may be > there for a long time. TMPDIRs get emptied on reboot, for one. > > > I think cache isn't suitable for standard, because its purpose > > may differ from program (or class, module, etc.) to program. > > > For example, if I want to put class A's cache on a fast hard disk > > with very large size, and program B's cache on a slower hard disk > > with small size, what should I do? > > That is true. OK, leave caching for the modules to resolve. I'll use > my own caching of GEO XML objects. From ngoto at gen-info.osaka-u.ac.jp Tue Sep 9 07:47:46 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Tue, 9 Sep 2008 20:47:46 +0900 Subject: [BioRuby] GFF attributes In-Reply-To: References: Message-ID: <20080909114748.555DB1CBC528@idnmail.gen-info.osaka-u.ac.jp> Hi, On Fri, 5 Sep 2008 15:43:05 +0900 Tomoaki NISHIYAMA wrote: > Hi, > > When extracting attributes from a GFF file, > older implementation seem to have eat the last character before ";". > Current, (downloaded very recently from github), does not split well, > as the regular expression search the largest match. Thank you for reporting a bug. > A patch is included, but I am not sure on the specification. > http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml > The specification says: > > From version 2 onwards, the attribute field must have an tag value > > structure following the syntax used within objects in a .ace file, > > flattened onto one line by semicolon separators. Tags must be > > standard identifiers ([A-Za-z][A-Za-z0-9_]*). Free text values must > > be quoted with double quotes. Note: all non-printing characters in > > such free text value strings (e.g. newlines, tabs, control > > characters, etc) must be explicitly represented by their C (UNIX) > > style backslash-escaped representation (e.g. newlines as '\n', tabs > > as '\t'). I also see BioPerl's _from_gff2_string in Bio::Tools::GFF http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/Tools/GFF.html#CODE10 It seems is still has bugs (as described in comments in their code), but semicolons inside double quotes are treated as normal letters and not separators for attributes. > So, it seems that for proper parsing, quotation with double quote > should be checked for free text, > and semicolon in that quatation is not a separator > for attributes and semicolon may not be preceeded with back slash. I've changed to do so. This means the patch was not used. http://github.com/ngoto/bioruby/commit/e38fd48aaf41f94eaec39a639a7f6c5db62c22e8 (This is my repository. Because the change seems severe, I'll push to the main bioruby repository later, after checking more and more.) To prevent repeating the bug, I want to use the GFF string described in your mail for the test script in BioRuby. (test/unit/bio/db/test_gff.rb) Can you give permission? Best regards, Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > Anyway, the file I am looking now is not that complex, > and I will go with a quick hack at this time. > > Best regards, > > Tomoaki > > the test program > $ cat test-gff.rb > #!/usr/local/bin/ruby > require 'bio' > gff_str = "LG_I\tJGI\tCDS\t11052\t11064\t.\t-\t0\tname > \"grail3.0116000101\"; proteinId 639579; exonNumber 3\n" > Bio::GFF.new(gff_str).records.each do |fr| > p fr > end > > output after patch > $ /usr/local/bin/ruby test-gff.rb > # @comments=nil, @strand="-", @feature="CDS", @score=".", > @source="JGI", @attributes={"name"=>"\"grail3.0116000101\"", > "proteinId"=>"639579", "exonNumber"=>"3"}, @end="11064", > @seqname="LG_I"> > > output from current > # @comments=nil, @strand="-", @feature="CDS", @score=".", > @source="JGI", @attributes={"name"=>"\"grail3.0116000101\"; proteinId > 639579", "exonNumber"=>"3"}, @end="11064", @seqname="LG_I"> > > older output > # @frame="0", @start="11052", @comments=nil, @strand="-", > @feature="CDS", @score=".", @source="JGI", @attributes= > {"name"=>"\"grail3.0116000101", "proteinId"=>"63957", > "exonNumber"=>"3"}> > > diff -ur bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ > bio/db/gff.rb bioruby-a/lib/bio/db/gff.rb > --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ > db/gff.rb 2008-09-03 22:24:39.000000000 +0900 > +++ bioruby-a/lib/bio/db/gff.rb 2008-09-05 14:56:50.000000000 +0900 > @@ -122,7 +122,7 @@ > def parse_attributes(attributes) > hash = Hash.new > scanner = StringScanner.new(attributes) > - while scanner.scan(/(.*[^\\])\;/) or scanner.scan(/(.+)/) > + while scanner.scan(/(([^;]|\\;)*[^\\])\;/) or scanner.scan(/ > (.+)/) > key, value = scanner[1].split(' ', 2) > key.strip! > value.strip! if value > > > -- > Tomoaki NISHIYAMA > > Advanced Science Research Center, > Kanazawa University, > 13-1 Takara-machi, > Kanazawa, 920-0934, Japan > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From ngoto at gen-info.osaka-u.ac.jp Tue Sep 9 21:48:20 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 10 Sep 2008 10:48:20 +0900 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080909113816.GA10051@thebird.nl> References: <20080902065055.GA29634@thebird.nl> <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> <20080902091958.GA31400@thebird.nl> <20080909113816.GA10051@thebird.nl> Message-ID: <20080910014822.36F5F1CBC3D3@idnmail.gen-info.osaka-u.ac.jp> Hi, I think the most important thing for cache is data integrity. For example, timing for detecting updates of original data, controlling accesses and resolving race conditions (two or more processes or threads simultaneously want to use, update, create, and/or remove the same cache data). However, your code only contains directory name determination. line 24: > def set directory, subdir = nil In def lines, please use parentheses explicitly, e.g. def set(directory, subdir = nil), because most of existing code in BioRuby does so. line 28: > dir = dir + '/' + subdir File.join(dir, subdir) should be used, possibly to support non-UNIX systems like Windows. lines 41 to 45: > if cache==nil or cache=='' > cache = ENV['TMPDIR'] > end > cache = '/tmp' if cache==nil or cache=='' > set cache, subdir Using Dir.tmpdir defined in tempdir.rb is better. http://www.ruby-doc.org/stdlib/libdoc/tmpdir/rdoc/index.html Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Tue, 9 Sep 2008 13:38:16 +0200 pjotr2008 at thebird.nl (Pjotr Prins) wrote: > I wrote a simple file Cache Singleton. See: > > http://github.com/pjotrp/bioruby/tree/462614487767568f41db03d894875a3d78ced08e/lib/bio/db/microarray/cache.rb > > The Cache can be read and set with: > > dir = Bio::Microarray::Cache.instance.directory('GEO') > # override cache dir > dir = Bio::Microarray::Cache.instance.set(newcachedir,'GEO') > > Everyone OK with this? > > Pj. > > On Tue, Sep 02, 2008 at 11:19:58AM +0200, Pjotr Prins wrote: > > > Note that some classes use Tempfile class, a standard bundled > > > class with Ruby by default, and the Tempfile class depends > > > on enviroment variables (TMPDIR, TMP, etc.). > > > > I noticed. Caching is a bit different in nature - as caches may be > > there for a long time. TMPDIRs get emptied on reboot, for one. > > > > > I think cache isn't suitable for standard, because its purpose > > > may differ from program (or class, module, etc.) to program. > > > > > For example, if I want to put class A's cache on a fast hard disk > > > with very large size, and program B's cache on a slower hard disk > > > with small size, what should I do? > > > > That is true. OK, leave caching for the modules to resolve. I'll use > > my own caching of GEO XML objects. From pjotr2008 at thebird.nl Wed Sep 10 03:48:58 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Wed, 10 Sep 2008 09:48:58 +0200 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080910014822.36F5F1CBC3D3@idnmail.gen-info.osaka-u.ac.jp> References: <20080902065055.GA29634@thebird.nl> <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> <20080902091958.GA31400@thebird.nl> <20080909113816.GA10051@thebird.nl> <20080910014822.36F5F1CBC3D3@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20080910074858.GA16861@thebird.nl> Hi Naohisa, Thanks for comments. See below. On Wed, Sep 10, 2008 at 10:48:20AM +0900, Naohisa GOTO wrote: > Hi, > > I think the most important thing for cache is data integrity. > For example, timing for detecting updates of original data, > controlling accesses and resolving race conditions > (two or more processes or threads simultaneously want to > use, update, create, and/or remove the same cache data). > However, your code only contains directory name determination. Well, caching is a universal term for storing stuff intermediately. And what I need is a place to put files. With regard to race conditions you are right - if two processes were to download the same file it would get mangled. However, them being XML the program would throw an error on parsing. For me that works well enough. For BioRuby we may need to think of something more universal - and it is not that hard to do. That is why I wrote my earlier mail. If you want to support something universal it should be at a higher point in the source tree. But maybe leave it until someone gets an itch to scratch. > line 24: > > def set directory, subdir = nil > > In def lines, please use parentheses explicitly, > e.g. def set(directory, subdir = nil), > because most of existing code in BioRuby does so. I like the 'most'. But OK. > line 28: > > dir = dir + '/' + subdir > > File.join(dir, subdir) should be used, possibly to support > non-UNIX systems like Windows. OK > lines 41 to 45: > > if cache==nil or cache=='' > > cache = ENV['TMPDIR'] > > end > > cache = '/tmp' if cache==nil or cache=='' > > set cache, subdir > > Using Dir.tmpdir defined in tempdir.rb is better. > http://www.ruby-doc.org/stdlib/libdoc/tmpdir/rdoc/index.html Thanks, Pj. From pjotr2008 at thebird.nl Wed Sep 10 06:27:10 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Wed, 10 Sep 2008 12:27:10 +0200 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080910074858.GA16861@thebird.nl> References: <20080902065055.GA29634@thebird.nl> <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> <20080902091958.GA31400@thebird.nl> <20080909113816.GA10051@thebird.nl> <20080910014822.36F5F1CBC3D3@idnmail.gen-info.osaka-u.ac.jp> <20080910074858.GA16861@thebird.nl> Message-ID: <20080910102710.GA18178@thebird.nl> I have made available for testing Bio::Microarray support for Affy and GEO XML and MINiML formats. The next step will be support for RMA and quantile normalisation. See: http://github.com/pjotrp/bioruby/tree/master http://github.com/pjotrp/bioruby/tree/master/lib/bio/db/microarray git://github.com/pjotrp/bioruby.git Enjoy, Pj. From pjotr2008 at thebird.nl Wed Sep 10 06:36:45 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Wed, 10 Sep 2008 12:36:45 +0200 Subject: [BioRuby] Introducing microarray support in BioRuby Message-ID: <20080910103645.GA18598@thebird.nl> Sorry, should have used a different Subject. On Wed, Sep 10, 2008 at 12:27:10PM +0200, Pjotr Prins wrote: > I have made available for testing Bio::Microarray support for Affy and > GEO XML and MINiML formats. The next step will be support for RMA and > quantile normalisation. See: > > http://github.com/pjotrp/bioruby/tree/master > > http://github.com/pjotrp/bioruby/tree/master/lib/bio/db/microarray > > git://github.com/pjotrp/bioruby.git > > Enjoy, > > Pj. > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From tomoakin at kenroku.kanazawa-u.ac.jp Wed Sep 10 21:51:43 2008 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Thu, 11 Sep 2008 10:51:43 +0900 Subject: [BioRuby] Translate ambiguous sequence Message-ID: <9E7111E0-F2DC-485D-B829-C7BD116517CD@kenroku.kanazawa-u.ac.jp> Hi, Bioruby's translate any codon containing ambiguity code to unknown or "X". However, sometimes, it is desirable to translate into a fixed amino acid when it is possible. tty -> "F" seeing the core implementation being naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or unknown} changing unknown to ct.translate_ambiguity(codon, unknown) will not hurt the performance for sequence without ambiguity, and trying to resolve degenerate codons is worth to do. Also, the sequence in GenBank is usually translated as such. What do you think? diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ bio/data/codontable.rb bioruby-c/lib/bio/data/codontable.rb --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ data/codontable.rb 2008-09-03 22:24:39.000000000 +0900 +++ bioruby-c/lib/bio/data/codontable.rb 2008-09-11 09:49:23.000000000 +0900 @@ -93,6 +93,23 @@ def [](codon) @table[codon] end + def translate_ambiguity(codon, unknown = 'X') + triplet = codon + "NNN" + aa = nil + Bio::NucleicAcid.ambiguity2individual(triplet[2..2]).each do|third| + Bio::NucleicAcid.ambiguity2individual(triplet[0..0]).each do| first| + Bio::NucleicAcid.ambiguity2individual(triplet[1..1]).each do| second| + if aa == nil + aa = @table[first+second+third] + elsif + aa != @table[first+second+third] + return unknown + end + end + end + end + aa + end # Modify the codon table. Use with caution as it may break hard coded # tables. If you want to modify existing table, you should use copy diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ bio/data/na.rb bioruby-c/lib/bio/data/na.rb --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ data/na.rb 2008-09-03 22:24:39.000000000 +0900 +++ bioruby-c/lib/bio/data/na.rb 2008-09-11 09:26:00.000000000 +0900 @@ -182,6 +182,13 @@ end Regexp.new(str) end + def ambiguity2individual(na, rna = false) + str = NAMES[na.downcase].gsub(/[\[\]]/,"") + if rna + str.tr!("t", "u") + end + str.split(//) + end end diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ bio/sequence/na.rb bioruby-c/lib/bio/sequence/na.rb --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ sequence/na.rb2008-09-03 22:24:39.000000000 +0900 +++ bioruby-c/lib/bio/sequence/na.rb 2008-09-11 09:48:52.000000000 +0900 @@ -252,7 +252,7 @@ end nalen = naseq.length - from nalen -= nalen % 3 - aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or unknown} + aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or ct.translate_ambiguity(codon, unknown)} return Bio::Sequence::AA.new(aaseq) end -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan From tomoakin at kenroku.kanazawa-u.ac.jp Wed Sep 10 22:34:36 2008 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Thu, 11 Sep 2008 11:34:36 +0900 Subject: [BioRuby] GFF attributes In-Reply-To: <20080909114748.555DB1CBC528@idnmail.gen-info.osaka-u.ac.jp> References: <20080909114748.555DB1CBC528@idnmail.gen-info.osaka-u.ac.jp> Message-ID: Hi > To prevent repeating the bug, I want to use the GFF string > described in your mail for the test script in BioRuby. > (test/unit/bio/db/test_gff.rb) > Can you give permission? Surely, I have no objection. The string is one of the line in the Popular genome annotation from the JGI site. ftp://ftp.jgi-psf.org/pub/JGI_data/Poplar/annotation/v1.1/ Poptr1_1.JamboreeModels.gff.gz So, I think acknowledging them is a good idea. For test string, I think another pattern including multiple value for one key is worth to add. The example from http://www.sanger.ac.uk/Software/formats/GFF/ GFF_Spec.shtml: seq1 BLASTX similarity 101 235 87.1 + 0 Target "HBA_HUMAN" 11 55 ; E_value 0.0003 Perhaps current implementation will return '"HBA_HUMAN" 11 55' as the value for 'Target'. But returning an Array ['"HBA_HUMAN"', '11', '55'] may be more sensible, or represent more of the meaning of the specification. Since changing this return value will make incompatibilities, I'm not sure whether it can be changed. But if it is ever to be changed, it is better changed early, or stated as such. If it is too late, perhaps we can make a method under a different name so that currently working code will not be affected. -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan On 2008/09/09, at 20:47, Naohisa GOTO wrote: > Hi, > > On Fri, 5 Sep 2008 15:43:05 +0900 > Tomoaki NISHIYAMA wrote: > >> Hi, >> >> When extracting attributes from a GFF file, >> older implementation seem to have eat the last character before ";". >> Current, (downloaded very recently from github), does not split well, >> as the regular expression search the largest match. > > Thank you for reporting a bug. > >> A patch is included, but I am not sure on the specification. >> http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml >> The specification says: >>> From version 2 onwards, the attribute field must have an tag value >>> structure following the syntax used within objects in a .ace file, >>> flattened onto one line by semicolon separators. Tags must be >>> standard identifiers ([A-Za-z][A-Za-z0-9_]*). Free text values must >>> be quoted with double quotes. Note: all non-printing characters in >>> such free text value strings (e.g. newlines, tabs, control >>> characters, etc) must be explicitly represented by their C (UNIX) >>> style backslash-escaped representation (e.g. newlines as '\n', tabs >>> as '\t'). > > I also see BioPerl's _from_gff2_string in Bio::Tools::GFF > http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/ > Tools/GFF.html#CODE10 > It seems is still has bugs (as described in comments in their code), > but semicolons inside double quotes are treated as normal letters > and not separators for attributes. > >> So, it seems that for proper parsing, quotation with double quote >> should be checked for free text, >> and semicolon in that quatation is not a separator >> for attributes and semicolon may not be preceeded with back slash. > > I've changed to do so. This means the patch was not used. > > http://github.com/ngoto/bioruby/commit/ > e38fd48aaf41f94eaec39a639a7f6c5db62c22e8 > (This is my repository. Because the change seems severe, > I'll push to the main bioruby repository later, > after checking more and more.) > > To prevent repeating the bug, I want to use the GFF string > described in your mail for the test script in BioRuby. > (test/unit/bio/db/test_gff.rb) > Can you give permission? > > Best regards, > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > >> >> Anyway, the file I am looking now is not that complex, >> and I will go with a quick hack at this time. >> >> Best regards, >> >> Tomoaki >> >> the test program >> $ cat test-gff.rb >> #!/usr/local/bin/ruby >> require 'bio' >> gff_str = "LG_I\tJGI\tCDS\t11052\t11064\t.\t-\t0\tname >> \"grail3.0116000101\"; proteinId 639579; exonNumber 3\n" >> Bio::GFF.new(gff_str).records.each do |fr| >> p fr >> end >> >> output after patch >> $ /usr/local/bin/ruby test-gff.rb >> #> @comments=nil, @strand="-", @feature="CDS", @score=".", >> @source="JGI", @attributes={"name"=>"\"grail3.0116000101\"", >> "proteinId"=>"639579", "exonNumber"=>"3"}, @end="11064", >> @seqname="LG_I"> >> >> output from current >> #> @comments=nil, @strand="-", @feature="CDS", @score=".", >> @source="JGI", @attributes={"name"=>"\"grail3.0116000101\"; proteinId >> 639579", "exonNumber"=>"3"}, @end="11064", @seqname="LG_I"> >> >> older output >> #> @frame="0", @start="11052", @comments=nil, @strand="-", >> @feature="CDS", @score=".", @source="JGI", @attributes= >> {"name"=>"\"grail3.0116000101", "proteinId"=>"63957", >> "exonNumber"=>"3"}> >> >> diff -ur bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/ >> lib/ >> bio/db/gff.rb bioruby-a/lib/bio/db/gff.rb >> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ >> db/gff.rb 2008-09-03 22:24:39.000000000 +0900 >> +++ bioruby-a/lib/bio/db/gff.rb 2008-09-05 14:56:50.000000000 +0900 >> @@ -122,7 +122,7 @@ >> def parse_attributes(attributes) >> hash = Hash.new >> scanner = StringScanner.new(attributes) >> - while scanner.scan(/(.*[^\\])\;/) or scanner.scan(/(.+)/) >> + while scanner.scan(/(([^;]|\\;)*[^\\])\;/) or scanner.scan(/ >> (.+)/) >> key, value = scanner[1].split(' ', 2) >> key.strip! >> value.strip! if value >> >> >> -- >> Tomoaki NISHIYAMA >> >> Advanced Science Research Center, >> Kanazawa University, >> 13-1 Takara-machi, >> Kanazawa, 920-0934, Japan >> >> >> _______________________________________________ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From tomoakin at kenroku.kanazawa-u.ac.jp Mon Sep 15 06:08:56 2008 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Mon, 15 Sep 2008 19:08:56 +0900 Subject: [BioRuby] Translate ambiguous sequence In-Reply-To: <9E7111E0-F2DC-485D-B829-C7BD116517CD@kenroku.kanazawa-u.ac.jp> References: <9E7111E0-F2DC-485D-B829-C7BD116517CD@kenroku.kanazawa-u.ac.jp> Message-ID: Hi, To further make translation compatible what is done between DNA entry and protein entry in databases, I thought that special treatment of the start codon and incomplete codons are necessary. Special treatment of the start codons are for those codons that is translated to M only when it is used as the start codon and a different amino acids if it is used as an internal codon within a CDS. For example GUG is V if it is internal to the CDS, but it can also serve as a start codon and in that case it encodes M. To change the behavior, I think an option is required. Incomplete codons are seen at the end of incomplete CDS, presumably due to cloning or sequencing strategy. When there are 'cg' at the end of CDS that are translated to 'R' as any nucleotide would make the codon translate as 'R' It seems the translation are added only if the amino acid can be specified and is not 'X'. -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ bio/data/codontable.rb bioruby-a/lib/bio/data/codontable.rb --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ data/codontable.rb 2008-09-03 22:24:39.000000000 +0900 +++ bioruby-a/lib/bio/data/codontable.rb 2008-09-13 12:06:28.000000000 +0900 @@ -93,6 +93,23 @@ def [](codon) @table[codon] end + def translate_ambiguity(codon, unknown = 'X') + triplet = codon + "NNN" + aa = nil + Bio::NucleicAcid.ambiguity2individual(triplet[2..2]).each do|third| + Bio::NucleicAcid.ambiguity2individual(triplet[0..0]).each do| first| + Bio::NucleicAcid.ambiguity2individual(triplet[1..1]).each do| second| + if aa == nil + aa = @table[first+second+third] + elsif + aa != @table[first+second+third] + return unknown + end + end + end + end + aa + end # Modify the codon table. Use with caution as it may break hard coded # tables. If you want to modify existing table, you should use copy diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ bio/data/na.rb bioruby-a/lib/bio/data/na.rb --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ data/na.rb 2008-09-03 22:24:39.000000000 +0900 +++ bioruby-a/lib/bio/data/na.rb 2008-09-13 12:06:28.000000000 +0900 @@ -182,6 +182,13 @@ end Regexp.new(str) end + def ambiguity2individual(na, rna = false) + str = NAMES[na.downcase].gsub(/[\[\]]/,"") + if rna + str.tr!("t", "u") + end + str.split(//) + end end diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ bio/sequence/na.rb bioruby-a/lib/bio/sequence/na.rb --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ sequence/na.rb 2008-09-03 22:24:39.000000000 +0900 +++ bioruby-a/lib/bio/sequence/na.rb 2008-09-15 18:57:19.000000000 +0900 @@ -231,7 +231,7 @@ # (default 1) # * (optional) _unknown_: Character (default 'X') # *Returns*:: Bio::Sequence::AA object - def translate(frame = 1, table = 1, unknown = 'X') + def translate(frame = 1, table = 1, unknown = 'X', check_start = false) if table.is_a?(Bio::CodonTable) ct = table else @@ -251,8 +251,19 @@ from = 0 end nalen = naseq.length - from - nalen -= nalen % 3 - aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or unknown} +# nalen -= nalen % 3 + if check_start and from == 0 and ct.start_codon?(naseq[0, 3]) + if nalen > 3 + aaseq = "M" + naseq[from+3, nalen-3].gsub(/.{1,3}/) {|codon| ct[codon] or ct.translate_ambiguity(codon, unknown)} + else + aaseq = "M" + end + else + aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct[codon] or ct.translate_ambiguity(codon, unknown)} + end + if nalen % 3 != 0 + aaseq.sub!(/X$/,"") + end return Bio::Sequence::AA.new(aaseq) end From ktym at hgc.jp Mon Sep 15 08:12:52 2008 From: ktym at hgc.jp (Toshiaki Katayama) Date: Mon, 15 Sep 2008 21:12:52 +0900 Subject: [BioRuby] Translate ambiguous sequence In-Reply-To: References: <9E7111E0-F2DC-485D-B829-C7BD116517CD@kenroku.kanazawa-u.ac.jp> Message-ID: <8D8CEB76-06C4-4474-95B2-B0BD9AD38E03@hgc.jp> Hi, * check_start As you suggested, the codon table object (Bio::CodonTable) holds a list of start codons as a knowledge, but Bio::Sequence::NA#translate method does not utilize it (it is also true for the stop codons). lib/bio/data/codontable.rb: ------------------------------------------------------------ # Create your own codon table by giving a Hash table of codons and relevant # amino acids. You can also able to define the table's name as a second # argument. # # Two Arrays 'start' and 'stop' can be specified which contains a list of # start and stop codons used by 'start_codon?' and 'stop_codon?' methods. def initialize(hash, definition = nil, start = [], stop = []) @table = hash @definition = definition @start = start @stop = stop.empty? ? generate_stop : stop end ------------------------------------------------------------ So, the following your code should be included in someway (but I prefer to set check_start = true by default; and use 'first_codon' variable explicitly instead of naseq[0, 3]). ------------------------------------------------------------ + if check_start and from == 0 and ct.start_codon?(naseq[0, 3]) ------------------------------------------------------------ * ambiguity As for the ambiguity, your needs seems to be restricted only for the 3' end of the sequence, but there may be demands for translating 'n's in the sequence. As the Bio::Sequence::NA#translate accepts the codon table object of your own as the 2nd argument, and you can copy and override the default codon tables (#1 to #23; or you can define your own codon table from scratch), there may be another approach to define ambiguous translations by your own. ------------------------------------------------------------ your_codon_table = Bio::CodonTable.copy(1) your_codon_table['cgn'] = 'R' your_codon_table['cg'] = 'R' aaseq = naseq.translate(frame, your_codon_table) ------------------------------------------------------------ To do this, we only need to change the following lines lib/bio/sequence/na.rb (translate): ------------------------------------------------------------ nalen -= nalen % 3 aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or unknown} ------------------------------------------------------------ to the below ------------------------------------------------------------ #nalen -= nalen % 3 aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct[codon] or unknown} ------------------------------------------------------------ but may be with a toggle flag to enable/disable this feature. Regards, Toshiaki Katayama On 2008/09/15, at 19:08, Tomoaki NISHIYAMA wrote: > Hi, > > To further make translation compatible what is done between DNA entry and protein > entry in databases, I thought that special treatment of the start codon and > incomplete codons are necessary. > > Special treatment of the start codons are for those codons that is > translated to M only when it is used as the start codon and > a different amino acids if it is used as an internal codon within a CDS. > For example GUG is V if it is internal to the CDS, but it can also serve > as a start codon and in that case it encodes M. > To change the behavior, I think an option is required. > > Incomplete codons are seen at the end of incomplete CDS, presumably due to > cloning or sequencing strategy. > When there are 'cg' at the end of CDS that are translated to 'R' > as any nucleotide would make the codon translate as 'R' > > It seems the translation are added only if the amino acid can be specified and is not 'X'. > -- > Tomoaki NISHIYAMA > > Advanced Science Research Center, > Kanazawa University, > 13-1 Takara-machi, > Kanazawa, 920-0934, Japan > > diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/codontable.rb bioruby-a/lib/bio/data/codontable.rb > --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/codontable.rb 2008-09-03 22:24:39.000000000 +0900 > +++ bioruby-a/lib/bio/data/codontable.rb 2008-09-13 12:06:28.000000000 +0900 > @@ -93,6 +93,23 @@ > def [](codon) > @table[codon] > end > + def translate_ambiguity(codon, unknown = 'X') > + triplet = codon + "NNN" > + aa = nil > + Bio::NucleicAcid.ambiguity2individual(triplet[2..2]).each do|third| > + Bio::NucleicAcid.ambiguity2individual(triplet[0..0]).each do|first| > + Bio::NucleicAcid.ambiguity2individual(triplet[1..1]).each do|second| > + if aa == nil > + aa = @table[first+second+third] > + elsif > + aa != @table[first+second+third] > + return unknown > + end > + end > + end > + end > + aa > + end > > # Modify the codon table. Use with caution as it may break hard coded > # tables. If you want to modify existing table, you should use copy > diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/na.rb bioruby-a/lib/bio/data/na.rb > --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/na.rb 2008-09-03 22:24:39.000000000 +0900 > +++ bioruby-a/lib/bio/data/na.rb 2008-09-13 12:06:28.000000000 +0900 > @@ -182,6 +182,13 @@ > end > Regexp.new(str) > end > + def ambiguity2individual(na, rna = false) > + str = NAMES[na.downcase].gsub(/[\[\]]/,"") > + if rna > + str.tr!("t", "u") > + end > + str.split(//) > + end > > end > > diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/sequence/na.rb bioruby-a/lib/bio/sequence/na.rb > --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/sequence/na.rb 2008-09-03 22:24:39.000000000 +0900 > +++ bioruby-a/lib/bio/sequence/na.rb 2008-09-15 18:57:19.000000000 +0900 > @@ -231,7 +231,7 @@ > # (default 1) > # * (optional) _unknown_: Character (default 'X') > # *Returns*:: Bio::Sequence::AA object > - def translate(frame = 1, table = 1, unknown = 'X') > + def translate(frame = 1, table = 1, unknown = 'X', check_start = false) > if table.is_a?(Bio::CodonTable) > ct = table > else > @@ -251,8 +251,19 @@ > from = 0 > end > nalen = naseq.length - from > - nalen -= nalen % 3 > - aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or unknown} > +# nalen -= nalen % 3 > + if check_start and from == 0 and ct.start_codon?(naseq[0, 3]) > + if nalen > 3 > + aaseq = "M" + naseq[from+3, nalen-3].gsub(/.{1,3}/) {|codon| ct[codon] or ct.translate_ambiguity(codon, unknown)} > + else > + aaseq = "M" > + end > + else > + aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct[codon] or ct.translate_ambiguity(codon, unknown)} > + end > + if nalen % 3 != 0 > + aaseq.sub!(/X$/,"") > + end > return Bio::Sequence::AA.new(aaseq) > end > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From tomoakin at kenroku.kanazawa-u.ac.jp Mon Sep 15 23:15:19 2008 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Tue, 16 Sep 2008 12:15:19 +0900 Subject: [BioRuby] Translate ambiguous sequence In-Reply-To: <8D8CEB76-06C4-4474-95B2-B0BD9AD38E03@hgc.jp> References: <9E7111E0-F2DC-485D-B829-C7BD116517CD@kenroku.kanazawa-u.ac.jp> <8D8CEB76-06C4-4474-95B2-B0BD9AD38E03@hgc.jp> Message-ID: <3136C698-6CD5-4749-90BF-514F4E7AFDB7@kenroku.kanazawa-u.ac.jp> Hi, Thank you for comments. > (but I prefer to set check_start = true by default; It was set to false for the default for just not to change the default behavior and is ok to make true for me. If the change of the interface is allowed, I prefer that the unknown be later option, since changing the unknown from 'X' is expected to be very rare, and, in fact, it can be done just a gsub operation without the help of the library. > As for the ambiguity, your needs seems to be restricted > only for the 3' end of the sequence, but there may be demands > for translating 'n's in the sequence. My need is not restricted to the 3' end, and also not restricted to 'N's but there are ten other IUPAC redundant codes. The message on September 11 treated only on these situations (where whole triplet is given but contain an ambiguity code) but not conscious on the start and the 3' end translation of 2 base. I agree that addition of all possible redundant determinate codes to the codon tables is another way to resolve the ambiguity codes. But the table will be quite large to support all the possible combinations for all the tables (at least for human review), and a generator should be written. Expecting that sequences containing ambiguity is rare, I wrote the code that will not impact the efficiency of translating sequence without ambiguity. Apparently the code for ambiguity is quite expensive, but I do not expect translating sequences containing so many ambiguity code that is problematic. (High proportion of ambiguity in itself is ok if the sequence is not very long). -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan On 2008/09/15, at 21:12, Toshiaki Katayama wrote: > Hi, > > * check_start > > As you suggested, the codon table object (Bio::CodonTable) holds a > list of > start codons as a knowledge, but Bio::Sequence::NA#translate method > does not > utilize it (it is also true for the stop codons). > > lib/bio/data/codontable.rb: > ------------------------------------------------------------ > # Create your own codon table by giving a Hash table of codons > and relevant > # amino acids. You can also able to define the table's name as a > second > # argument. > # > # Two Arrays 'start' and 'stop' can be specified which contains a > list of > # start and stop codons used by 'start_codon?' and 'stop_codon?' > methods. > def initialize(hash, definition = nil, start = [], stop = []) > @table = hash > @definition = definition > @start = start > @stop = stop.empty? ? generate_stop : stop > end > ------------------------------------------------------------ > > So, the following your code should be included in someway > (but I prefer to set check_start = true by default; and > use 'first_codon' variable explicitly instead of naseq[0, 3]). > > ------------------------------------------------------------ > + if check_start and from == 0 and ct.start_codon?(naseq[0, 3]) > ------------------------------------------------------------ > > > * ambiguity > > As for the ambiguity, your needs seems to be restricted > only for the 3' end of the sequence, but there may be demands > for translating 'n's in the sequence. > > As the Bio::Sequence::NA#translate accepts the codon table object > of your own as the 2nd argument, and you can copy and override > the default codon tables (#1 to #23; or you can define your own > codon table from scratch), there may be another approach to define > ambiguous translations by your own. > > ------------------------------------------------------------ > your_codon_table = Bio::CodonTable.copy(1) > your_codon_table['cgn'] = 'R' > your_codon_table['cg'] = 'R' > > aaseq = naseq.translate(frame, your_codon_table) > ------------------------------------------------------------ > > To do this, we only need to change the following lines > > lib/bio/sequence/na.rb (translate): > ------------------------------------------------------------ > nalen -= nalen % 3 > aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or > unknown} > ------------------------------------------------------------ > > to the below > > ------------------------------------------------------------ > #nalen -= nalen % 3 > aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct[codon] or > unknown} > ------------------------------------------------------------ > > but may be with a toggle flag to enable/disable this feature. > > Regards, > Toshiaki Katayama > > > > On 2008/09/15, at 19:08, Tomoaki NISHIYAMA wrote: > >> Hi, >> >> To further make translation compatible what is done between DNA >> entry and protein >> entry in databases, I thought that special treatment of the start >> codon and >> incomplete codons are necessary. >> >> Special treatment of the start codons are for those codons that is >> translated to M only when it is used as the start codon and >> a different amino acids if it is used as an internal codon within >> a CDS. >> For example GUG is V if it is internal to the CDS, but it can also >> serve >> as a start codon and in that case it encodes M. >> To change the behavior, I think an option is required. >> >> Incomplete codons are seen at the end of incomplete CDS, >> presumably due to >> cloning or sequencing strategy. >> When there are 'cg' at the end of CDS that are translated to 'R' >> as any nucleotide would make the codon translate as 'R' >> >> It seems the translation are added only if the amino acid can be >> specified and is not 'X'. >> -- >> Tomoaki NISHIYAMA >> >> Advanced Science Research Center, >> Kanazawa University, >> 13-1 Takara-machi, >> Kanazawa, 920-0934, Japan >> >> diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/ >> lib/bio/data/codontable.rb bioruby-a/lib/bio/data/codontable.rb >> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ >> bio/data/codontable.rb 2008-09-03 22:24:39.000000000 +0900 >> +++ bioruby-a/lib/bio/data/codontable.rb 2008-09-13 >> 12:06:28.000000000 +0900 >> @@ -93,6 +93,23 @@ >> def [](codon) >> @table[codon] >> end >> + def translate_ambiguity(codon, unknown = 'X') >> + triplet = codon + "NNN" >> + aa = nil >> + Bio::NucleicAcid.ambiguity2individual(triplet[2..2]).each do| >> third| >> + Bio::NucleicAcid.ambiguity2individual(triplet[0..0]).each >> do|first| >> + Bio::NucleicAcid.ambiguity2individual(triplet[1..1]).each >> do|second| >> + if aa == nil >> + aa = @table[first+second+third] >> + elsif >> + aa != @table[first+second+third] >> + return unknown >> + end >> + end >> + end >> + end >> + aa >> + end >> >> # Modify the codon table. Use with caution as it may break hard >> coded >> # tables. If you want to modify existing table, you should use >> copy >> diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/ >> lib/bio/data/na.rb bioruby-a/lib/bio/data/na.rb >> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ >> bio/data/na.rb 2008-09-03 22:24:39.000000000 +0900 >> +++ bioruby-a/lib/bio/data/na.rb 2008-09-13 >> 12:06:28.000000000 +0900 >> @@ -182,6 +182,13 @@ >> end >> Regexp.new(str) >> end >> + def ambiguity2individual(na, rna = false) >> + str = NAMES[na.downcase].gsub(/[\[\]]/,"") >> + if rna >> + str.tr!("t", "u") >> + end >> + str.split(//) >> + end >> >> end >> >> diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/ >> lib/bio/sequence/na.rb bioruby-a/lib/bio/sequence/na.rb >> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ >> bio/sequence/na.rb 2008-09-03 22:24:39.000000000 +0900 >> +++ bioruby-a/lib/bio/sequence/na.rb 2008-09-15 >> 18:57:19.000000000 +0900 >> @@ -231,7 +231,7 @@ >> # (default 1) >> # * (optional) _unknown_: Character (default 'X') >> # *Returns*:: Bio::Sequence::AA object >> - def translate(frame = 1, table = 1, unknown = 'X') >> + def translate(frame = 1, table = 1, unknown = 'X', check_start >> = false) >> if table.is_a?(Bio::CodonTable) >> ct = table >> else >> @@ -251,8 +251,19 @@ >> from = 0 >> end >> nalen = naseq.length - from >> - nalen -= nalen % 3 >> - aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or >> unknown} >> +# nalen -= nalen % 3 >> + if check_start and from == 0 and ct.start_codon?(naseq[0, 3]) >> + if nalen > 3 >> + aaseq = "M" + naseq[from+3, nalen-3].gsub(/.{1,3}/) {| >> codon| ct[codon] or ct.translate_ambiguity(codon, unknown)} >> + else >> + aaseq = "M" >> + end >> + else >> + aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct >> [codon] or ct.translate_ambiguity(codon, unknown)} >> + end >> + if nalen % 3 != 0 >> + aaseq.sub!(/X$/,"") >> + end >> return Bio::Sequence::AA.new(aaseq) >> end >> >> >> _______________________________________________ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From ktym at hgc.jp Tue Sep 16 00:56:14 2008 From: ktym at hgc.jp (Toshiaki Katayama) Date: Tue, 16 Sep 2008 13:56:14 +0900 Subject: [BioRuby] Translate ambiguous sequence In-Reply-To: <3136C698-6CD5-4749-90BF-514F4E7AFDB7@kenroku.kanazawa-u.ac.jp> References: <9E7111E0-F2DC-485D-B829-C7BD116517CD@kenroku.kanazawa-u.ac.jp> <8D8CEB76-06C4-4474-95B2-B0BD9AD38E03@hgc.jp> <3136C698-6CD5-4749-90BF-514F4E7AFDB7@kenroku.kanazawa-u.ac.jp> Message-ID: Hi, > It was set to false for the default for just not to > change the default behavior and is ok to make true for me. I just thought that if the main application of the 'translate' method is to translate gene to protein sequence, current implementation is incomplete and should be changed. If not, retain the current behavior may be better. > If the change of the interface is allowed, > I prefer that the unknown be later option, since > changing the unknown from 'X' is expected to be very rare, > and, in fact, it can be done just a gsub operation without > the help of the library. I can agree (don't know how others think, though). Another option is to provide different methods (interfaces) for considering start/stop codons and ambiguous bases. Or introduce named options... > My need is not restricted to the 3' end, and also not restricted to > 'N's but there are ten other IUPAC redundant codes. Sorry, I misunderstood your code. You are trying to translate all possible combinations of the ambiguous bases on the fly. Your code is fine and followings are just for discussion: Is there no efficient way to statically generate a reduction of the given codon table considering ambiguous bases...? Your implementation seems to return 'unknown' if the translation of the codon containing ambiguous bases are translated to the different amino acid, however, the comparison occurs every time when the codon is passed to the 'translate_ambiguity' method. It would be helpful to know how many patterns needed to be generated to match codons with ambiguous bases for 20 amino acids. Is it possible to rewrite current Bio::CodonTable implementation to utilize Regexp as a key for the codon table hash for this purpose? Regards, Toshiaki Katayama On 2008/09/16, at 12:15, Tomoaki NISHIYAMA wrote: > Hi, > > Thank you for comments. > > (but I prefer to set check_start = true by default; > It was set to false for the default for just not to > change the default behavior and is ok to make true for me. > If the change of the interface is allowed, > I prefer that the unknown be later option, since > changing the unknown from 'X' is expected to be very rare, > and, in fact, it can be done just a gsub operation without > the help of the library. > >> As for the ambiguity, your needs seems to be restricted >> only for the 3' end of the sequence, but there may be demands >> for translating 'n's in the sequence. > > > My need is not restricted to the 3' end, and also not restricted to > 'N's but there are ten other IUPAC redundant codes. > The message on September 11 treated only on these situations > (where whole triplet is given but contain an ambiguity code) > but not conscious on the start and the 3' end translation of 2 base. > > I agree that addition of all possible redundant determinate codes to the codon tables > is another way to resolve the ambiguity codes. > But the table will be quite large to support all the possible > combinations for all the tables (at least for human review), > and a generator should be written. > Expecting that sequences containing ambiguity is rare, I wrote the code that will > not impact the efficiency of translating sequence without ambiguity. > Apparently the code for ambiguity is quite expensive, but I do not expect translating > sequences containing so many ambiguity code that is problematic. > (High proportion of ambiguity in itself is ok if the sequence is not very long). > -- > Tomoaki NISHIYAMA > > Advanced Science Research Center, > Kanazawa University, > 13-1 Takara-machi, > Kanazawa, 920-0934, Japan > > > On 2008/09/15, at 21:12, Toshiaki Katayama wrote: > >> Hi, >> >> * check_start >> >> As you suggested, the codon table object (Bio::CodonTable) holds a list of >> start codons as a knowledge, but Bio::Sequence::NA#translate method does not >> utilize it (it is also true for the stop codons). >> >> lib/bio/data/codontable.rb: >> ------------------------------------------------------------ >> # Create your own codon table by giving a Hash table of codons and relevant >> # amino acids. You can also able to define the table's name as a second >> # argument. >> # >> # Two Arrays 'start' and 'stop' can be specified which contains a list of >> # start and stop codons used by 'start_codon?' and 'stop_codon?' methods. >> def initialize(hash, definition = nil, start = [], stop = []) >> @table = hash >> @definition = definition >> @start = start >> @stop = stop.empty? ? generate_stop : stop >> end >> ------------------------------------------------------------ >> >> So, the following your code should be included in someway >> (but I prefer to set check_start = true by default; and >> use 'first_codon' variable explicitly instead of naseq[0, 3]). >> >> ------------------------------------------------------------ >> + if check_start and from == 0 and ct.start_codon?(naseq[0, 3]) >> ------------------------------------------------------------ >> >> >> * ambiguity >> >> As for the ambiguity, your needs seems to be restricted >> only for the 3' end of the sequence, but there may be demands >> for translating 'n's in the sequence. >> >> As the Bio::Sequence::NA#translate accepts the codon table object >> of your own as the 2nd argument, and you can copy and override >> the default codon tables (#1 to #23; or you can define your own >> codon table from scratch), there may be another approach to define >> ambiguous translations by your own. >> >> ------------------------------------------------------------ >> your_codon_table = Bio::CodonTable.copy(1) >> your_codon_table['cgn'] = 'R' >> your_codon_table['cg'] = 'R' >> >> aaseq = naseq.translate(frame, your_codon_table) >> ------------------------------------------------------------ >> >> To do this, we only need to change the following lines >> >> lib/bio/sequence/na.rb (translate): >> ------------------------------------------------------------ >> nalen -= nalen % 3 >> aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or unknown} >> ------------------------------------------------------------ >> >> to the below >> >> ------------------------------------------------------------ >> #nalen -= nalen % 3 >> aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct[codon] or unknown} >> ------------------------------------------------------------ >> >> but may be with a toggle flag to enable/disable this feature. >> >> Regards, >> Toshiaki Katayama >> >> >> >> On 2008/09/15, at 19:08, Tomoaki NISHIYAMA wrote: >> >>> Hi, >>> >>> To further make translation compatible what is done between DNA entry and protein >>> entry in databases, I thought that special treatment of the start codon and >>> incomplete codons are necessary. >>> >>> Special treatment of the start codons are for those codons that is >>> translated to M only when it is used as the start codon and >>> a different amino acids if it is used as an internal codon within a CDS. >>> For example GUG is V if it is internal to the CDS, but it can also serve >>> as a start codon and in that case it encodes M. >>> To change the behavior, I think an option is required. >>> >>> Incomplete codons are seen at the end of incomplete CDS, presumably due to >>> cloning or sequencing strategy. >>> When there are 'cg' at the end of CDS that are translated to 'R' >>> as any nucleotide would make the codon translate as 'R' >>> >>> It seems the translation are added only if the amino acid can be specified and is not 'X'. >>> -- >>> Tomoaki NISHIYAMA >>> >>> Advanced Science Research Center, >>> Kanazawa University, >>> 13-1 Takara-machi, >>> Kanazawa, 920-0934, Japan >>> >>> diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/codontable.rb bioruby-a/lib/bio/data/codontable.rb >>> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/codontable.rb 2008-09-03 22:24:39.000000000 +0900 >>> +++ bioruby-a/lib/bio/data/codontable.rb 2008-09-13 12:06:28.000000000 +0900 >>> @@ -93,6 +93,23 @@ >>> def [](codon) >>> @table[codon] >>> end >>> + def translate_ambiguity(codon, unknown = 'X') >>> + triplet = codon + "NNN" >>> + aa = nil >>> + Bio::NucleicAcid.ambiguity2individual(triplet[2..2]).each do|third| >>> + Bio::NucleicAcid.ambiguity2individual(triplet[0..0]).each do|first| >>> + Bio::NucleicAcid.ambiguity2individual(triplet[1..1]).each do|second| >>> + if aa == nil >>> + aa = @table[first+second+third] >>> + elsif >>> + aa != @table[first+second+third] >>> + return unknown >>> + end >>> + end >>> + end >>> + end >>> + aa >>> + end >>> >>> # Modify the codon table. Use with caution as it may break hard coded >>> # tables. If you want to modify existing table, you should use copy >>> diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/na.rb bioruby-a/lib/bio/data/na.rb >>> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/na.rb 2008-09-03 22:24:39.000000000 +0900 >>> +++ bioruby-a/lib/bio/data/na.rb 2008-09-13 12:06:28.000000000 +0900 >>> @@ -182,6 +182,13 @@ >>> end >>> Regexp.new(str) >>> end >>> + def ambiguity2individual(na, rna = false) >>> + str = NAMES[na.downcase].gsub(/[\[\]]/,"") >>> + if rna >>> + str.tr!("t", "u") >>> + end >>> + str.split(//) >>> + end >>> >>> end >>> >>> diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/sequence/na.rb bioruby-a/lib/bio/sequence/na.rb >>> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/sequence/na.rb 2008-09-03 22:24:39.000000000 +0900 >>> +++ bioruby-a/lib/bio/sequence/na.rb 2008-09-15 18:57:19.000000000 +0900 >>> @@ -231,7 +231,7 @@ >>> # (default 1) >>> # * (optional) _unknown_: Character (default 'X') >>> # *Returns*:: Bio::Sequence::AA object >>> - def translate(frame = 1, table = 1, unknown = 'X') >>> + def translate(frame = 1, table = 1, unknown = 'X', check_start = false) >>> if table.is_a?(Bio::CodonTable) >>> ct = table >>> else >>> @@ -251,8 +251,19 @@ >>> from = 0 >>> end >>> nalen = naseq.length - from >>> - nalen -= nalen % 3 >>> - aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or unknown} >>> +# nalen -= nalen % 3 >>> + if check_start and from == 0 and ct.start_codon?(naseq[0, 3]) >>> + if nalen > 3 >>> + aaseq = "M" + naseq[from+3, nalen-3].gsub(/.{1,3}/) {|codon| ct[codon] or ct.translate_ambiguity(codon, unknown)} >>> + else >>> + aaseq = "M" >>> + end >>> + else >>> + aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct[codon] or ct.translate_ambiguity(codon, unknown)} >>> + end >>> + if nalen % 3 != 0 >>> + aaseq.sub!(/X$/,"") >>> + end >>> return Bio::Sequence::AA.new(aaseq) >>> end >>> >>> >>> _______________________________________________ >>> BioRuby mailing list >>> BioRuby at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioruby >> >> >> _______________________________________________ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby >> > From ngoto at gen-info.osaka-u.ac.jp Tue Sep 16 01:12:31 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Tue, 16 Sep 2008 14:12:31 +0900 Subject: [BioRuby] Translate ambiguous sequence In-Reply-To: References: <9E7111E0-F2DC-485D-B829-C7BD116517CD@kenroku.kanazawa-u.ac.jp> <8D8CEB76-06C4-4474-95B2-B0BD9AD38E03@hgc.jp> <3136C698-6CD5-4749-90BF-514F4E7AFDB7@kenroku.kanazawa-u.ac.jp> Message-ID: <20080916051231.E52721CBC4F5@idnmail.gen-info.osaka-u.ac.jp> On Tue, 16 Sep 2008 13:56:14 +0900 Toshiaki Katayama wrote: > Hi, > > > It was set to false for the default for just not to > > change the default behavior and is ok to make true for me. > > I just thought that if the main application of the 'translate' > method is to translate gene to protein sequence, current > implementation is incomplete and should be changed. > If not, retain the current behavior may be better. I'm using the "translate" not only for whole genes, but also for partial sequences and/or sequences with unknown start positions. So, I don't want to change the default. -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From tomoakin at kenroku.kanazawa-u.ac.jp Tue Sep 16 02:38:37 2008 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Tue, 16 Sep 2008 15:38:37 +0900 Subject: [BioRuby] Translate ambiguous sequence In-Reply-To: References: <9E7111E0-F2DC-485D-B829-C7BD116517CD@kenroku.kanazawa-u.ac.jp> <8D8CEB76-06C4-4474-95B2-B0BD9AD38E03@hgc.jp> <3136C698-6CD5-4749-90BF-514F4E7AFDB7@kenroku.kanazawa-u.ac.jp> Message-ID: <5712116A-2F5D-460D-8557-896A83B2861E@kenroku.kanazawa-u.ac.jp> Hi, > Is there no efficient way to statically generate a reduction of > the given codon table considering ambiguous bases...? > > Your implementation seems to return 'unknown' if the translation of > the codon containing ambiguous bases are translated to the different > amino acid, however, the comparison occurs every time when the codon > is passed to the 'translate_ambiguity' method. > > It would be helpful to know how many patterns needed to be generated > to match codons with ambiguous bases for 20 amino acids. Generation of the hash in itself is not very difficult, (just iterate over all the possible triplet and dinucleotides, with some assumption on the table) and 174-195 keys are sufficient for each of preexisting codon tables. (for 20 amino acids plus '*') The benefit is usually quite low as there are little ambiguity in the DNA sequence (because low quality regions are deleted at an earlier process). The hash might worth included for standard codontables when someone are to directly process a large quantity of poor quality sequence data. (Maybe 454 or Solexa?) For codontable object that are copied and modified, I expect there are little cases when the cost to generate that table for ambiguity treatment is smaller than the on the fly comparison. #!/usr/local/bin/ruby require 'bio' dnanucleotides = ['a', 'c', 'g', 't', 'y', 'r', 'w', 's', 'k', 'm', 'b', 'd', 'h', 'v', 'n'] tableary=Array.new [1, 2, 3, 4, 5, 6, 9, 10, 11, 12, 13, 14, 15, 16, 21, 22, 23].each do |tableno| partialhash = Hash.new dnanucleotides.each do |first| dnanucleotides.each do |second| dnaseq = Bio::Sequence::NA.new(first + second) transl = dnaseq.translate(1,tableno) if transl != 'X' and transl != "" partialhash[dnaseq] = transl end dnanucleotides.each do |third| dnaseq = Bio::Sequence::NA.new(first + second + third) transl = dnaseq.translate(1,tableno) if transl != 'X' partialhash[dnaseq] = transl end end end end puts "table#{tableno}: #{partialhash.size} patterns" # p partialhash end -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan On 2008/09/16, at 13:56, Toshiaki Katayama wrote: > Hi, > >> It was set to false for the default for just not to >> change the default behavior and is ok to make true for me. > > I just thought that if the main application of the 'translate' > method is to translate gene to protein sequence, current > implementation is incomplete and should be changed. > If not, retain the current behavior may be better. > >> If the change of the interface is allowed, >> I prefer that the unknown be later option, since >> changing the unknown from 'X' is expected to be very rare, >> and, in fact, it can be done just a gsub operation without >> the help of the library. > > I can agree (don't know how others think, though). > Another option is to provide different methods (interfaces) > for considering start/stop codons and ambiguous bases. > Or introduce named options... > >> My need is not restricted to the 3' end, and also not restricted to >> 'N's but there are ten other IUPAC redundant codes. > > Sorry, I misunderstood your code. > > You are trying to translate all possible combinations of the ambiguous > bases on the fly. > > Your code is fine and followings are just for discussion: > > Is there no efficient way to statically generate a reduction of > the given codon table considering ambiguous bases...? > > Your implementation seems to return 'unknown' if the translation of > the codon containing ambiguous bases are translated to the different > amino acid, however, the comparison occurs every time when the codon > is passed to the 'translate_ambiguity' method. > > It would be helpful to know how many patterns needed to be generated > to match codons with ambiguous bases for 20 amino acids. > > Is it possible to rewrite current Bio::CodonTable implementation > to utilize Regexp as a key for the codon table hash for this purpose? > > Regards, > Toshiaki Katayama > > > On 2008/09/16, at 12:15, Tomoaki NISHIYAMA wrote: > >> Hi, >> >> Thank you for comments. >>> (but I prefer to set check_start = true by default; >> It was set to false for the default for just not to >> change the default behavior and is ok to make true for me. >> If the change of the interface is allowed, >> I prefer that the unknown be later option, since >> changing the unknown from 'X' is expected to be very rare, >> and, in fact, it can be done just a gsub operation without >> the help of the library. >> >>> As for the ambiguity, your needs seems to be restricted >>> only for the 3' end of the sequence, but there may be demands >>> for translating 'n's in the sequence. >> >> >> My need is not restricted to the 3' end, and also not restricted to >> 'N's but there are ten other IUPAC redundant codes. >> The message on September 11 treated only on these situations >> (where whole triplet is given but contain an ambiguity code) >> but not conscious on the start and the 3' end translation of 2 base. >> >> I agree that addition of all possible redundant determinate codes >> to the codon tables >> is another way to resolve the ambiguity codes. >> But the table will be quite large to support all the possible >> combinations for all the tables (at least for human review), >> and a generator should be written. >> Expecting that sequences containing ambiguity is rare, I wrote the >> code that will >> not impact the efficiency of translating sequence without ambiguity. >> Apparently the code for ambiguity is quite expensive, but I do not >> expect translating >> sequences containing so many ambiguity code that is problematic. >> (High proportion of ambiguity in itself is ok if the sequence is >> not very long). >> -- >> Tomoaki NISHIYAMA >> >> Advanced Science Research Center, >> Kanazawa University, >> 13-1 Takara-machi, >> Kanazawa, 920-0934, Japan >> >> >> On 2008/09/15, at 21:12, Toshiaki Katayama wrote: >> >>> Hi, >>> >>> * check_start >>> >>> As you suggested, the codon table object (Bio::CodonTable) holds >>> a list of >>> start codons as a knowledge, but Bio::Sequence::NA#translate >>> method does not >>> utilize it (it is also true for the stop codons). >>> >>> lib/bio/data/codontable.rb: >>> ------------------------------------------------------------ >>> # Create your own codon table by giving a Hash table of codons >>> and relevant >>> # amino acids. You can also able to define the table's name as >>> a second >>> # argument. >>> # >>> # Two Arrays 'start' and 'stop' can be specified which contains >>> a list of >>> # start and stop codons used by 'start_codon?' and 'stop_codon?' >>> methods. >>> def initialize(hash, definition = nil, start = [], stop = []) >>> @table = hash >>> @definition = definition >>> @start = start >>> @stop = stop.empty? ? generate_stop : stop >>> end >>> ------------------------------------------------------------ >>> >>> So, the following your code should be included in someway >>> (but I prefer to set check_start = true by default; and >>> use 'first_codon' variable explicitly instead of naseq[0, 3]). >>> >>> ------------------------------------------------------------ >>> + if check_start and from == 0 and ct.start_codon?(naseq[0, 3]) >>> ------------------------------------------------------------ >>> >>> >>> * ambiguity >>> >>> As for the ambiguity, your needs seems to be restricted >>> only for the 3' end of the sequence, but there may be demands >>> for translating 'n's in the sequence. >>> >>> As the Bio::Sequence::NA#translate accepts the codon table object >>> of your own as the 2nd argument, and you can copy and override >>> the default codon tables (#1 to #23; or you can define your own >>> codon table from scratch), there may be another approach to define >>> ambiguous translations by your own. >>> >>> ------------------------------------------------------------ >>> your_codon_table = Bio::CodonTable.copy(1) >>> your_codon_table['cgn'] = 'R' >>> your_codon_table['cg'] = 'R' >>> >>> aaseq = naseq.translate(frame, your_codon_table) >>> ------------------------------------------------------------ >>> >>> To do this, we only need to change the following lines >>> >>> lib/bio/sequence/na.rb (translate): >>> ------------------------------------------------------------ >>> nalen -= nalen % 3 >>> aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or >>> unknown} >>> ------------------------------------------------------------ >>> >>> to the below >>> >>> ------------------------------------------------------------ >>> #nalen -= nalen % 3 >>> aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct[codon] or >>> unknown} >>> ------------------------------------------------------------ >>> >>> but may be with a toggle flag to enable/disable this feature. >>> >>> Regards, >>> Toshiaki Katayama >>> >>> >>> >>> On 2008/09/15, at 19:08, Tomoaki NISHIYAMA wrote: >>> >>>> Hi, >>>> >>>> To further make translation compatible what is done between DNA >>>> entry and protein >>>> entry in databases, I thought that special treatment of the >>>> start codon and >>>> incomplete codons are necessary. >>>> >>>> Special treatment of the start codons are for those codons that is >>>> translated to M only when it is used as the start codon and >>>> a different amino acids if it is used as an internal codon >>>> within a CDS. >>>> For example GUG is V if it is internal to the CDS, but it can >>>> also serve >>>> as a start codon and in that case it encodes M. >>>> To change the behavior, I think an option is required. >>>> >>>> Incomplete codons are seen at the end of incomplete CDS, >>>> presumably due to >>>> cloning or sequencing strategy. >>>> When there are 'cg' at the end of CDS that are translated to 'R' >>>> as any nucleotide would make the codon translate as 'R' >>>> >>>> It seems the translation are added only if the amino acid can be >>>> specified and is not 'X'. >>>> -- >>>> Tomoaki NISHIYAMA >>>> >>>> Advanced Science Research Center, >>>> Kanazawa University, >>>> 13-1 Takara-machi, >>>> Kanazawa, 920-0934, Japan >>>> >>>> diff -ru bioruby- >>>> bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/ >>>> codontable.rb bioruby-a/lib/bio/data/codontable.rb >>>> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ >>>> bio/data/codontable.rb 2008-09-03 22:24:39.000000000 +0900 >>>> +++ bioruby-a/lib/bio/data/codontable.rb 2008-09-13 >>>> 12:06:28.000000000 +0900 >>>> @@ -93,6 +93,23 @@ >>>> def [](codon) >>>> @table[codon] >>>> end >>>> + def translate_ambiguity(codon, unknown = 'X') >>>> + triplet = codon + "NNN" >>>> + aa = nil >>>> + Bio::NucleicAcid.ambiguity2individual(triplet[2..2]).each >>>> do|third| >>>> + Bio::NucleicAcid.ambiguity2individual(triplet[0..0]).each >>>> do|first| >>>> + Bio::NucleicAcid.ambiguity2individual(triplet >>>> [1..1]).each do|second| >>>> + if aa == nil >>>> + aa = @table[first+second+third] >>>> + elsif >>>> + aa != @table[first+second+third] >>>> + return unknown >>>> + end >>>> + end >>>> + end >>>> + end >>>> + aa >>>> + end >>>> >>>> # Modify the codon table. Use with caution as it may break >>>> hard coded >>>> # tables. If you want to modify existing table, you should use >>>> copy >>>> diff -ru bioruby- >>>> bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/ >>>> na.rb bioruby-a/lib/bio/data/na.rb >>>> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ >>>> bio/data/na.rb 2008-09-03 22:24:39.000000000 +0900 >>>> +++ bioruby-a/lib/bio/data/na.rb 2008-09-13 >>>> 12:06:28.000000000 +0900 >>>> @@ -182,6 +182,13 @@ >>>> end >>>> Regexp.new(str) >>>> end >>>> + def ambiguity2individual(na, rna = false) >>>> + str = NAMES[na.downcase].gsub(/[\[\]]/,"") >>>> + if rna >>>> + str.tr!("t", "u") >>>> + end >>>> + str.split(//) >>>> + end >>>> >>>> end >>>> >>>> diff -ru bioruby- >>>> bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ >>>> sequence/na.rb bioruby-a/lib/bio/sequence/na.rb >>>> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ >>>> bio/sequence/na.rb 2008-09-03 22:24:39.000000000 +0900 >>>> +++ bioruby-a/lib/bio/sequence/na.rb 2008-09-15 >>>> 18:57:19.000000000 +0900 >>>> @@ -231,7 +231,7 @@ >>>> # (default 1) >>>> # * (optional) _unknown_: Character (default 'X') >>>> # *Returns*:: Bio::Sequence::AA object >>>> - def translate(frame = 1, table = 1, unknown = 'X') >>>> + def translate(frame = 1, table = 1, unknown = 'X', >>>> check_start = false) >>>> if table.is_a?(Bio::CodonTable) >>>> ct = table >>>> else >>>> @@ -251,8 +251,19 @@ >>>> from = 0 >>>> end >>>> nalen = naseq.length - from >>>> - nalen -= nalen % 3 >>>> - aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] >>>> or unknown} >>>> +# nalen -= nalen % 3 >>>> + if check_start and from == 0 and ct.start_codon?(naseq[0, 3]) >>>> + if nalen > 3 >>>> + aaseq = "M" + naseq[from+3, nalen-3].gsub(/.{1,3}/) {| >>>> codon| ct[codon] or ct.translate_ambiguity(codon, unknown)} >>>> + else >>>> + aaseq = "M" >>>> + end >>>> + else >>>> + aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct >>>> [codon] or ct.translate_ambiguity(codon, unknown)} >>>> + end >>>> + if nalen % 3 != 0 >>>> + aaseq.sub!(/X$/,"") >>>> + end >>>> return Bio::Sequence::AA.new(aaseq) >>>> end >>>> >>>> >>>> _______________________________________________ >>>> BioRuby mailing list >>>> BioRuby at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioruby >>> >>> >>> _______________________________________________ >>> BioRuby mailing list >>> BioRuby at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioruby >>> >> > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From sgujja at broad.mit.edu Tue Sep 16 15:34:11 2008 From: sgujja at broad.mit.edu (Sharvari Gujja) Date: Tue, 16 Sep 2008 15:34:11 -0400 Subject: [BioRuby] Bio::Blast::RPSBlast::Report Message-ID: <48D00A33.2050906@broad.mit.edu> Hi, Can someone please direct me to Bio::Blast::RPSBlast::Report documentation/examples ? Thanks S From ngoto at gen-info.osaka-u.ac.jp Tue Sep 16 22:44:28 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 17 Sep 2008 11:44:28 +0900 Subject: [BioRuby] Bio::Blast::RPSBlast::Report In-Reply-To: <48D00A33.2050906@broad.mit.edu> References: <48D00A33.2050906@broad.mit.edu> Message-ID: <20080917024429.0B3A91501EF@idnmail.gen-info.osaka-u.ac.jp> On Tue, 16 Sep 2008 15:34:11 -0400 Sharvari Gujja wrote: > Hi, > > Can someone please direct me to Bio::Blast::RPSBlast::Report > documentation/examples ? http://lists.open-bio.org/pipermail/bioruby/2008-April/000624.html Note that the Bio::Blast::RPSBlast::Report exists still only in development version, and the spec and usage would be changed before the release version in near future. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From ngoto at gen-info.osaka-u.ac.jp Tue Sep 16 23:56:19 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 17 Sep 2008 12:56:19 +0900 Subject: [BioRuby] GFF attributes In-Reply-To: References: <20080909114748.555DB1CBC528@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20080917035620.3EEA2150201@idnmail.gen-info.osaka-u.ac.jp> Hi, On Thu, 11 Sep 2008 11:34:36 +0900 Tomoaki NISHIYAMA wrote: > Hi > > > To prevent repeating the bug, I want to use the GFF string > > described in your mail for the test script in BioRuby. > > (test/unit/bio/db/test_gff.rb) > > Can you give permission? > > Surely, I have no objection. > The string is one of the line in the Popular genome annotation from > the JGI site. > ftp://ftp.jgi-psf.org/pub/JGI_data/Poplar/annotation/v1.1/ > Poptr1_1.JamboreeModels.gff.gz > So, I think acknowledging them is a good idea. Thank you. I'll add above URL in the comments of the test. > For test string, I think another pattern including multiple value for > one key is worth to add. > The example from http://www.sanger.ac.uk/Software/formats/GFF/ > GFF_Spec.shtml: > seq1 BLASTX similarity 101 235 87.1 + 0 Target "HBA_HUMAN" 11 > 55 ; E_value 0.0003 > > Perhaps current implementation will return '"HBA_HUMAN" 11 55' as the > value for 'Target'. > But returning an Array ['"HBA_HUMAN"', '11', '55'] may be more > sensible, or represent > more of the meaning of the specification. In this case, string escaping and quotation in free text can also be processed by the class, and [ 'HBA_HUMAN', 11', '55'] can be returned. > Since changing this return value will make incompatibilities, I'm not > sure > whether it can be changed. > But if it is ever to be changed, it is better changed early, or > stated as such. > If it is too late, perhaps we can make a method under a different > name so that > currently working code will not be affected. Indeed, for GFF2 attributes, I've alrealy found a design problem in current Bio::GFF::GFF2#attributes. Currently, a hash is used to store attributes, but the GFF2 spec allows more than two tags with the same name. For example, http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml#homology_feature Align 101 11 ; Align 179 36 ; In this case, with current bioruby implementation, the "Align 101 11" is overwritten by the latter "Align 179 36", and we can only get { "Align" => "179 36" }. To solve the problem, I can think the following two ways. 1. Using an Array to store values from multiple tags. For example, in the above case, @attributes = {} @attribures['Align'] = [ '101 11', '179 36' ] @attribures['Target'] = '"HBA_HUMAN" 11 54' I already took this approach in GFF3 with incompatible changes, because the previous implementation of GFF3#attributes was broken and cannot be used. But now, I just think this approch is not good and I want to change it now, because checking whether the value is an array or not is needed every time. In addition, in this case, we can not parse '"HBA_HUMAN" 11 54' to [ 'HBA_HUMAN', 11', '54'], because it is impossible to distinguish values from multiple tags or parsed values, unless an array is always used. 2. Giving up using hash, and using an array (or possibly a new class e.g. GFF2::Attributes) of [ tag, value ] pairs. For backward compatibility, hash can be dynamically generated when old behavior is requested. I think this approach is better. I'll implement this later. Any comments and suggestions are welcome. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From sgujja at broad.mit.edu Wed Sep 17 10:18:58 2008 From: sgujja at broad.mit.edu (Sharvari Gujja) Date: Wed, 17 Sep 2008 10:18:58 -0400 Subject: [BioRuby] Bio::Blast::RPSBlast::Report In-Reply-To: <20080917024429.0B3A91501EF@idnmail.gen-info.osaka-u.ac.jp> References: <48D00A33.2050906@broad.mit.edu> <20080917024429.0B3A91501EF@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <48D111D2.8010404@broad.mit.edu> Hi, Thank you so much for the info. However, on running the code for rpsblast output parser, I get the following error: *uninitialized constant Bio::Blast::RPSBlast (NameError)* I am not sure what exactly I am missing here. I really appreciate all the help. Thanks S Naohisa GOTO wrote: > On Tue, 16 Sep 2008 15:34:11 -0400 > Sharvari Gujja wrote: > > >> Hi, >> >> Can someone please direct me to Bio::Blast::RPSBlast::Report >> documentation/examples ? >> > > http://lists.open-bio.org/pipermail/bioruby/2008-April/000624.html > > Note that the Bio::Blast::RPSBlast::Report exists still > only in development version, and the spec and usage > would be changed before the release version in near future. > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From ngoto at gen-info.osaka-u.ac.jp Wed Sep 17 23:16:59 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 18 Sep 2008 12:16:59 +0900 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080910074858.GA16861@thebird.nl> References: <20080902065055.GA29634@thebird.nl> <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> <20080902091958.GA31400@thebird.nl> <20080909113816.GA10051@thebird.nl> <20080910014822.36F5F1CBC3D3@idnmail.gen-info.osaka-u.ac.jp> <20080910074858.GA16861@thebird.nl> Message-ID: <20080918031700.074B21CBC498@idnmail.gen-info.osaka-u.ac.jp> Hi Pjotr, If you don't want to implement any access control, using world writable directory like /tmp (comes from ENV['TMPDIR'] or Dir.tmpdir) by default should be disabled, because this is vulnerable to a symbolic link attack. About symbolic link attack, please refer documents: http://www.codeproject.com/KB/web-security/TemporaryFileSecurity.aspx (Note that Ruby's standard TempFile has no problem.) When the "cache" directory isn't explicitly specified by user by using the environment variable BIORUBY_CACHE (or command-line options of custom application), doing without cache should be the default. It is also good to raise SecurityError when the specified directory is writable by everyone. On Wed, 10 Sep 2008 09:48:58 +0200 pjotr2008 at thebird.nl (Pjotr Prins) wrote: > Hi Naohisa, > > Thanks for comments. See below. > > On Wed, Sep 10, 2008 at 10:48:20AM +0900, Naohisa GOTO wrote: > > Hi, > > > > I think the most important thing for cache is data integrity. > > For example, timing for detecting updates of original data, > > controlling accesses and resolving race conditions > > (two or more processes or threads simultaneously want to > > use, update, create, and/or remove the same cache data). > > However, your code only contains directory name determination. > > Well, caching is a universal term for storing stuff intermediately. > And what I need is a place to put files. With regard to race > conditions you are right - if two processes were to download the same > file it would get mangled. However, them being XML the program would > throw an error on parsing. For me that works well enough. For BioRuby > we may need to think of something more universal - and it is not that > hard to do. That is why I wrote my earlier mail. If you want to > support something universal it should be at a higher point in the > source tree. > > But maybe leave it until someone gets an itch to scratch. If the mangled XML was unfortunately syntax valid XML, no obvious error but incorrect data could be obtained. However, now, I believe "that works well enough". Plese write a document in RDoc about the limitation of current implementation when race condition. > > line 24: > > > def set directory, subdir = nil > > > > In def lines, please use parentheses explicitly, > > e.g. def set(directory, subdir = nil), > > because most of existing code in BioRuby does so. > > I like the 'most'. But OK. > > > line 28: > > > dir = dir + '/' + subdir > > > > File.join(dir, subdir) should be used, possibly to support > > non-UNIX systems like Windows. > > OK > > > lines 41 to 45: > > > if cache==nil or cache=='' > > > cache = ENV['TMPDIR'] > > > end > > > cache = '/tmp' if cache==nil or cache=='' > > > set cache, subdir > > > > Using Dir.tmpdir defined in tempdir.rb is better. > > http://www.ruby-doc.org/stdlib/libdoc/tmpdir/rdoc/index.html > > Thanks, > > Pj. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From pjotr2008 at thebird.nl Thu Sep 18 02:32:37 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Thu, 18 Sep 2008 08:32:37 +0200 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080918031700.074B21CBC498@idnmail.gen-info.osaka-u.ac.jp> References: <20080902065055.GA29634@thebird.nl> <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> <20080902091958.GA31400@thebird.nl> <20080909113816.GA10051@thebird.nl> <20080910014822.36F5F1CBC3D3@idnmail.gen-info.osaka-u.ac.jp> <20080910074858.GA16861@thebird.nl> <20080918031700.074B21CBC498@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20080918063237.GA17631@thebird.nl> Hi Naohisa, On Thu, Sep 18, 2008 at 12:16:59PM +0900, Naohisa GOTO wrote: > Hi Pjotr, > > If you don't want to implement any access control, > using world writable directory like /tmp (comes from > ENV['TMPDIR'] or Dir.tmpdir) by default should be disabled, > because this is vulnerable to a symbolic link attack. > > About symbolic link attack, please refer documents: > http://www.codeproject.com/KB/web-security/TemporaryFileSecurity.aspx > (Note that Ruby's standard TempFile has no problem.) I agree - assuming you are running a webservice for microarrays. > When the "cache" directory isn't explicitly specified > by user by using the environment variable BIORUBY_CACHE > (or command-line options of custom application), > doing without cache should be the default. NCBI won't be happy with that. But if that is what Bioruby wants... It is not only about my own bandwidth ;-). > It is also good to raise SecurityError when the specified > directory is writable by everyone. I'll remove tmpdir - I introduced it because of an earlier mail. Disabling the cache is easy - off course. Another option is to use TmpFiles and keep track of those in a Hash (I'd rather not have large IO objects in memory). OK, that is what I'll implement - assuming you want to include the microarray stuff in Bioruby. Pj. From davide.rambaldi at ifom-ieo-campus.it Fri Sep 19 08:49:40 2008 From: davide.rambaldi at ifom-ieo-campus.it (Davide Rambaldi) Date: Fri, 19 Sep 2008 14:49:40 +0200 Subject: [BioRuby] MacRuby Message-ID: May be you already know: MacRuby 0.3 Released with Interface Builder Support By joel at Wed, Sep 17 2008 11:02am |News ? Ruby Inside reports that the nascent MacRuby distribution, an implementation of Ruby 1.9 based on Mac OS X core technologies, has been updated to version 0.3. The most exciting change in this update is the support for Interface Builder and all the Xcode+IB goodness you need to build gorgeous, GUI-based scientific apps for OS X using the ever productive and succinct Ruby language. Also noteworthy is the inclusion of the HotCocoa library, which is somewhat of a domain specific language for working with Cocoa classes from Ruby. Hopefully a number MacRuby + BioRuby mashups will follow on the heels of this exciting development. best regards Davide Rambaldi, Bioinformatics PhD student. ----------------------------------------------------- Bioinformatic Group IFOM-IEO Campus Via Adamello 16, Milano I-20139 Italy [t] +39 02574303 066 [e] davide.rambaldi at ifom-ieo-campus.it [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/DavideRambaldi (homepage) [i] http://www.semm.it (PhD school) [i] http://www.btbs.unimib.it/ (Master) ----------------------------------------------------- From pjotr2008 at thebird.nl Fri Sep 19 10:05:14 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Fri, 19 Sep 2008 16:05:14 +0200 Subject: [BioRuby] RFC Unit testing large files Message-ID: <20080919140514.GA32740@thebird.nl> For microarray unit tests I have some 30Mb of files. Probably not very nice to put those in the source tree. The options are: 1. Host them in the source tree - huge downloads for everyone. 2. Fetch them on demand by the unit tests - takes long time the first time and where do I put them? In a cache directory? 3. Have the unit tests in a separate tree - special purpose testing 4. No unit tests for these I have the same unit tests in the biolib tree - but that is a hassle too. For BioRuby I propose (3). Maybe I ought to solely use the biolib tree for these specific unit tests and have a 'stub' in the Bioruby tree for them. This problem will come back - and keep in mind the free github space is 'only' 100 Mb. Pj. From pjotr2008 at thebird.nl Fri Sep 19 11:29:54 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Fri, 19 Sep 2008 17:29:54 +0200 Subject: [BioRuby] RFC Unit testing large files In-Reply-To: <1221837547.6231.5.camel@454-2> References: <20080919140514.GA32740@thebird.nl> <1221837547.6231.5.camel@454-2> Message-ID: <20080919152954.GA2058@thebird.nl> It is not simply testing some small code. It is to verify, for example, that large files get read properly - and that RMA normalization does its job. Otherwise I would certainly opt for such a solution. Pj. On Fri, Sep 19, 2008 at 05:19:07PM +0200, Raoul Jean Pierre Bonnal wrote: > Il giorno ven, 19/09/2008 alle 16.05 +0200, Pjotr Prins ha scritto: > > For microarray unit tests I have some 30Mb of files. Probably not > > very nice to put those in the source tree. The options are: > > > > 1. Host them in the source tree - huge downloads for everyone. > > > > 2. Fetch them on demand by the unit tests - takes long time the first > > time and where do I put them? In a cache directory? > > > > 3. Have the unit tests in a separate tree - special purpose testing > > > > 4. No unit tests for these > > > > I have the same unit tests in the biolib tree - but that is a hassle > > too. For BioRuby I propose (3). Maybe I ought to solely use the biolib > > tree for these specific unit tests and have a 'stub' in the Bioruby > > tree for them. > > > > This problem will come back - and keep in mind the free github space > > is 'only' 100 Mb. > > Create a piece of code to generate fake data for local test? > > -- > Ra From raoul.bonnal at itb.cnr.it Fri Sep 19 11:19:07 2008 From: raoul.bonnal at itb.cnr.it (Raoul Jean Pierre Bonnal) Date: Fri, 19 Sep 2008 17:19:07 +0200 Subject: [BioRuby] RFC Unit testing large files In-Reply-To: <20080919140514.GA32740@thebird.nl> References: <20080919140514.GA32740@thebird.nl> Message-ID: <1221837547.6231.5.camel@454-2> Il giorno ven, 19/09/2008 alle 16.05 +0200, Pjotr Prins ha scritto: > For microarray unit tests I have some 30Mb of files. Probably not > very nice to put those in the source tree. The options are: > > 1. Host them in the source tree - huge downloads for everyone. > > 2. Fetch them on demand by the unit tests - takes long time the first > time and where do I put them? In a cache directory? > > 3. Have the unit tests in a separate tree - special purpose testing > > 4. No unit tests for these > > I have the same unit tests in the biolib tree - but that is a hassle > too. For BioRuby I propose (3). Maybe I ought to solely use the biolib > tree for these specific unit tests and have a 'stub' in the Bioruby > tree for them. > > This problem will come back - and keep in mind the free github space > is 'only' 100 Mb. Create a piece of code to generate fake data for local test? -- Ra From pjotr2008 at thebird.nl Tue Sep 23 07:58:52 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Tue, 23 Sep 2008 13:58:52 +0200 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080918063237.GA17631@thebird.nl> References: <20080902065055.GA29634@thebird.nl> <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> <20080902091958.GA31400@thebird.nl> <20080909113816.GA10051@thebird.nl> <20080910014822.36F5F1CBC3D3@idnmail.gen-info.osaka-u.ac.jp> <20080910074858.GA16861@thebird.nl> <20080918031700.074B21CBC498@idnmail.gen-info.osaka-u.ac.jp> <20080918063237.GA17631@thebird.nl> Message-ID: <20080923115852.GA6808@thebird.nl> Hi Naohisa, I fixed the Cache to be secure. It will use a safe Tmpdir if no directory is specified and raise SecurityErrors when appropriate. See http://github.com/pjotrp/bioruby/tree/master Pj. On Thu, Sep 18, 2008 at 08:32:37AM +0200, Pjotr Prins wrote: > Hi Naohisa, > > On Thu, Sep 18, 2008 at 12:16:59PM +0900, Naohisa GOTO wrote: > > Hi Pjotr, > > > > If you don't want to implement any access control, > > using world writable directory like /tmp (comes from > > ENV['TMPDIR'] or Dir.tmpdir) by default should be disabled, > > because this is vulnerable to a symbolic link attack. > > > > About symbolic link attack, please refer documents: > > http://www.codeproject.com/KB/web-security/TemporaryFileSecurity.aspx > > (Note that Ruby's standard TempFile has no problem.) > > I agree - assuming you are running a webservice for microarrays. > > > When the "cache" directory isn't explicitly specified > > by user by using the environment variable BIORUBY_CACHE > > (or command-line options of custom application), > > doing without cache should be the default. > > NCBI won't be happy with that. But if that is what Bioruby wants... > It is not only about my own bandwidth ;-). > > > It is also good to raise SecurityError when the specified > > directory is writable by everyone. > > I'll remove tmpdir - I introduced it because of an earlier mail. > > Disabling the cache is easy - off course. Another option is to use > TmpFiles and keep track of those in a Hash (I'd rather not have large > IO objects in memory). OK, that is what I'll implement - assuming you > want to include the microarray stuff in Bioruby. > > Pj. > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From ngoto at gen-info.osaka-u.ac.jp Wed Sep 24 03:52:45 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 24 Sep 2008 16:52:45 +0900 Subject: [BioRuby] Bio::Blast::RPSBlast::Report In-Reply-To: <48D111D2.8010404@broad.mit.edu> References: <48D00A33.2050906@broad.mit.edu> <20080917024429.0B3A91501EF@idnmail.gen-info.osaka-u.ac.jp> <48D111D2.8010404@broad.mit.edu> Message-ID: <20080924075246.24F291CBC49F@idnmail.gen-info.osaka-u.ac.jp> The Bio::Blast::RPSBlast was introduced in April 2008, but bioruby 1.2.1, current latest release version, was released in December 2007. This means you need unreleased development version of bioruby in the github. You can download snapshot as a tarball http://github.com/bioruby/bioruby/tarball/master and install it (or extract it and set -I option or RUBYLIB enviroment etc.) Alternative way is to use git (see http://github.com/bioruby/bioruby/wikis ). As it is developmental version, it is unstable, something may not work frequently, and incompatible changes may be made. Please upgrade to new version immediately after new version released. In addtion, after commit 11f1787cf93c046c06d4a33a554210d56866274e, the limitation of multi-fasta report is eliminated when using with Bio::FlatFile. require 'bio' filename = 'test.rpsblast' Bio::FlatFile.open(Bio::Blast::RPSBlast::Report, filename) do |ff| i = 0 ff.each do |e| i += 1 print "Query\##{i} = ", e.query_def, "\n" j = 0 e.each do |hit| j += 1 print "Query\##{i}/Hit\##{j} = ", hit.target_def, "\n" k = 0 hit.each do |hsp| k += 1 print "Query\##{i}/Hit\##{j}/Hsp\##{k} = ", value=#{hsp.evalue}, ", "Positions #{hsp.query_from}..#{hsp.query_to}:", "#{hsp.hit_from}..#{hsp.hit_to}\n" print "Query : #{hsp.qseq}\n" print " #{hsp.midline}\n" print "Hit : #{hsp.hseq}\n" end end end end Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Wed, 17 Sep 2008 10:18:58 -0400 Sharvari Gujja wrote: > Hi, > > Thank you so much for the info. However, on running the code for > rpsblast output parser, I get the following error: > > *uninitialized constant Bio::Blast::RPSBlast (NameError)* > > I am not sure what exactly I am missing here. > > I really appreciate all the help. > > Thanks > S > > Naohisa GOTO wrote: > > On Tue, 16 Sep 2008 15:34:11 -0400 > > Sharvari Gujja wrote: > > > > > >> Hi, > >> > >> Can someone please direct me to Bio::Blast::RPSBlast::Report > >> documentation/examples ? > >> > > > > http://lists.open-bio.org/pipermail/bioruby/2008-April/000624.html > > > > Note that the Bio::Blast::RPSBlast::Report exists still > > only in development version, and the spec and usage > > would be changed before the release version in near future. > > > > Naohisa Goto > > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > _______________________________________________ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > From ngoto at gen-info.osaka-u.ac.jp Wed Sep 24 09:38:19 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 24 Sep 2008 22:38:19 +0900 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080923115852.GA6808@thebird.nl> References: <20080902065055.GA29634@thebird.nl> <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> <20080902091958.GA31400@thebird.nl> <20080909113816.GA10051@thebird.nl> <20080910014822.36F5F1CBC3D3@idnmail.gen-info.osaka-u.ac.jp> <20080910074858.GA16861@thebird.nl> <20080918031700.074B21CBC498@idnmail.gen-info.osaka-u.ac.jp> <20080918063237.GA17631@thebird.nl> <20080923115852.GA6808@thebird.nl> Message-ID: <20080924133825.B61B41CBC3D8@idnmail.gen-info.osaka-u.ac.jp> Hi Pjotr, I've seen files in your lib/bio/db/microarray, and I suppose it's still under development and it will be changed frequently, and I think it's not a time to include them in main bioruby. So, my comments below are mainly for future improvements. 1. about cache.rb The "safe = true" argument in 'set' and 'directory' seems bad idea. I think there is no need to give insecure options to users. In 'directory' method, > cache = Dir.mktmpdir(subdir) The Dir.mktmpdir method is a new feature added in Ruby 1.8.7, and not available in 1.8.6 and older versions. Because most users are still using Ruby 1.8.5 and 1.8.6, to avoid using Dir.mktmpdir is currently a choice. Alternatively, write a document that the feature can work only in Ruby 1.8.7 or later. Note that current requirement of BioRuby is "Ruby 1.8.2 or later (Ruby 1.8.4 or later is recommended)". Also note that FileUtils.remove_entry_secure was introduced in Ruby 1.8.3. Finally, I'm wondering if the Cache class can still be a singleton or not in the future. Currently, only NCBI_GEO is using the cache, but if it were used from many classes with different data formats, files with different formats would be existed in the same cache directory, and file name conflicts might be happened. 2. About file locations Below are recommended to be moved to bio/io/, because their main purpose is file or network I/O, and not data parsing. bio/db/microarray/cache.rb Bio::Microarray::GEO::XML in bio/db/microarray/ncbi_geo/geo.rb The class/module names are not needed to be changed. The files with external dependency to the "biolib" might also be suggested to be moved from bio/db to the other location, but no best location found. 3. BIo::Microarray::NCBI_GEO In bio/db/microarray/ncbi_geo/geo.rb, > include REXML If the aim to include REXML module is only to skip the REXML:: prefix, I don't like to include it in library, because the constants and methods defined in REXML are mixed and they might cause bad side effects. (Note that unlike in a library, it is free to include anything in an application.) > def XML::create(acc) In my impression, the method name "XML.create" might be reserved to be used by a method to create XML data structure from scratch or from some data. To define a class method, I like 'def self.create(acc)' because it is easy to change class (module) name. > def XML::fetch(xmlfn, acc) > url = "http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=#{acc}&form=xml&view=brief&retmode=xml" URI escaping is needed, e.g. acc=#{URI.escape(acc)} > print "Fetching ",url,"\n" if $VERBOSE > r = Net::HTTP.get_response( URI.parse( url ) ) To support proxy, use Bio::Command.get_uri(url). > def XML::valid_accession?(acc = nil) > acc = @acc if not acc > acc =~ /^(GSM|GSE|GPL)\d+$/ If "GSM0123\nGSM4567" is invalid, the regular expression should be /\A(GSM|GSE|GPL)\d+\z/ . > def XML::parsexml(acc) Is there no way to get input XML data as String? > if XML::valid_accession? acc > cache = Cache.instance.directory > fn = cache+'/'+acc+'.xml' Please use File.join. Thanks, Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Tue, 23 Sep 2008 13:58:52 +0200 pjotr2008 at thebird.nl (Pjotr Prins) wrote: > Hi Naohisa, > > I fixed the Cache to be secure. It will use a safe Tmpdir if no > directory is specified and raise SecurityErrors when appropriate. > > See http://github.com/pjotrp/bioruby/tree/master > > Pj. > > On Thu, Sep 18, 2008 at 08:32:37AM +0200, Pjotr Prins wrote: > > Hi Naohisa, > > > > On Thu, Sep 18, 2008 at 12:16:59PM +0900, Naohisa GOTO wrote: > > > Hi Pjotr, > > > > > > If you don't want to implement any access control, > > > using world writable directory like /tmp (comes from > > > ENV['TMPDIR'] or Dir.tmpdir) by default should be disabled, > > > because this is vulnerable to a symbolic link attack. > > > > > > About symbolic link attack, please refer documents: > > > http://www.codeproject.com/KB/web-security/TemporaryFileSecurity.aspx > > > (Note that Ruby's standard TempFile has no problem.) > > > > I agree - assuming you are running a webservice for microarrays. > > > > > When the "cache" directory isn't explicitly specified > > > by user by using the environment variable BIORUBY_CACHE > > > (or command-line options of custom application), > > > doing without cache should be the default. > > > > NCBI won't be happy with that. But if that is what Bioruby wants... > > It is not only about my own bandwidth ;-). > > > > > It is also good to raise SecurityError when the specified > > > directory is writable by everyone. > > > > I'll remove tmpdir - I introduced it because of an earlier mail. > > > > Disabling the cache is easy - off course. Another option is to use > > TmpFiles and keep track of those in a Hash (I'd rather not have large > > IO objects in memory). OK, that is what I'll implement - assuming you > > want to include the microarray stuff in Bioruby. > > > > Pj. > > _______________________________________________ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby From ngoto at gen-info.osaka-u.ac.jp Wed Sep 24 10:05:26 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 24 Sep 2008 23:05:26 +0900 Subject: [BioRuby] GFF attributes In-Reply-To: <20080917035620.3EEA2150201@idnmail.gen-info.osaka-u.ac.jp> References: <20080909114748.555DB1CBC528@idnmail.gen-info.osaka-u.ac.jp> <20080917035620.3EEA2150201@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20080924140526.E779C1CBC3C3@idnmail.gen-info.osaka-u.ac.jp> Hi, In my github repository, I've made incompatible changes in Bio::GFF::GFF2 and Bio::GFF::GFF3 classes. Now, attributes are stored as an Array containing [ tag, value ] pairs, for example, [ [ 'Gene', 'CEN1' ], [ 'E_value', '0.0003' ], [ 'Note', 'CEN1; Chromosome I Centromere' ] ]. To get an attribute, it is recommended to use a new method Record#arrtibute(tag) and so on. String escaping in free text is automatically processed. In addition, GFF2 attribute value with multiple tokens e.g. 'Target "HBA_HUMAN" 11 55' are parsed to Bio::GFF::GFF2::Record::Value object. (Note that a value with single token is still a String). To keep backward compatibility, the specification of Bio::GFF is not so changed except for bug fix. To use new feature, Bio::GFF::GFF2 or Bio::GFF::GFF3 should be explicitly used. For more details, please see http://github.com/ngoto/bioruby/commit/95391949d217e6f7c9ee7444afebec6ee8677035 If no problems are found, it will be included in the main bioruby repository. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Wed, 17 Sep 2008 12:56:19 +0900 Naohisa GOTO wrote: > Hi, > > On Thu, 11 Sep 2008 11:34:36 +0900 > Tomoaki NISHIYAMA wrote: > > > Hi > > > > > To prevent repeating the bug, I want to use the GFF string > > > described in your mail for the test script in BioRuby. > > > (test/unit/bio/db/test_gff.rb) > > > Can you give permission? > > > > Surely, I have no objection. > > The string is one of the line in the Popular genome annotation from > > the JGI site. > > ftp://ftp.jgi-psf.org/pub/JGI_data/Poplar/annotation/v1.1/ > > Poptr1_1.JamboreeModels.gff.gz > > So, I think acknowledging them is a good idea. > > Thank you. I'll add above URL in the comments of the test. > > > For test string, I think another pattern including multiple value for > > one key is worth to add. > > The example from http://www.sanger.ac.uk/Software/formats/GFF/ > > GFF_Spec.shtml: > > seq1 BLASTX similarity 101 235 87.1 + 0 Target "HBA_HUMAN" 11 > > 55 ; E_value 0.0003 > > > > Perhaps current implementation will return '"HBA_HUMAN" 11 55' as the > > value for 'Target'. > > But returning an Array ['"HBA_HUMAN"', '11', '55'] may be more > > sensible, or represent > > more of the meaning of the specification. > > In this case, string escaping and quotation in free text > can also be processed by the class, and > [ 'HBA_HUMAN', 11', '55'] can be returned. > > > Since changing this return value will make incompatibilities, I'm not > > sure > > whether it can be changed. > > But if it is ever to be changed, it is better changed early, or > > stated as such. > > If it is too late, perhaps we can make a method under a different > > name so that > > currently working code will not be affected. > > Indeed, for GFF2 attributes, I've alrealy found a > design problem in current Bio::GFF::GFF2#attributes. > Currently, a hash is used to store attributes, but > the GFF2 spec allows more than two tags with the same name. > > For example, > http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml#homology_feature > Align 101 11 ; Align 179 36 ; > > In this case, with current bioruby implementation, the > "Align 101 11" is overwritten by the latter "Align 179 36", > and we can only get { "Align" => "179 36" }. > > To solve the problem, I can think the following two ways. > > 1. Using an Array to store values from multiple tags. > > For example, in the above case, > @attributes = {} > @attribures['Align'] = [ '101 11', '179 36' ] > @attribures['Target'] = '"HBA_HUMAN" 11 54' > > I already took this approach in GFF3 with incompatible > changes, because the previous implementation of > GFF3#attributes was broken and cannot be used. > But now, I just think this approch is not good and > I want to change it now, because checking whether > the value is an array or not is needed every time. > > In addition, in this case, we can not parse > '"HBA_HUMAN" 11 54' to [ 'HBA_HUMAN', 11', '54'], > because it is impossible to distinguish values from > multiple tags or parsed values, unless an array is > always used. > > 2. Giving up using hash, and using an array (or possibly > a new class e.g. GFF2::Attributes) of [ tag, value ] > pairs. > > For backward compatibility, hash can be dynamically > generated when old behavior is requested. > > I think this approach is better. > I'll implement this later. > > Any comments and suggestions are welcome. > > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From pjotr2008 at thebird.nl Wed Sep 24 12:29:24 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Wed, 24 Sep 2008 18:29:24 +0200 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080924133825.B61B41CBC3D8@idnmail.gen-info.osaka-u.ac.jp> References: <20080902065055.GA29634@thebird.nl> <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> <20080902091958.GA31400@thebird.nl> <20080909113816.GA10051@thebird.nl> <20080910014822.36F5F1CBC3D3@idnmail.gen-info.osaka-u.ac.jp> <20080910074858.GA16861@thebird.nl> <20080918031700.074B21CBC498@idnmail.gen-info.osaka-u.ac.jp> <20080918063237.GA17631@thebird.nl> <20080923115852.GA6808@thebird.nl> <20080924133825.B61B41CBC3D8@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20080924162924.GA19778@thebird.nl> Hi Naohisa, On Wed, Sep 24, 2008 at 10:38:19PM +0900, Naohisa GOTO wrote: > Hi Pjotr, > > I've seen files in your lib/bio/db/microarray, and I suppose > it's still under development and it will be changed frequently, > and I think it's not a time to include them in main bioruby. > So, my comments below are mainly for future improvements. What there is is 'stable'. Certainly the NCBI stuff is rather complete. The biolib libraries could go in later. It is up to you, but I think it would be nice to have mainstream microarray support before one of the other Bio* libraries (and biolib support is there for all). We don't want to be beaten by BioPerl, for one ;-). If nothing else I can make a BioRuby-with-Microarrays gem available - but that may be confusing for others. Another thing, what is the point of open source software if no one tests it. How about regularly releasing a testing version of bioruby? We see some more activity in BioRuby - which is a good thing. You can't expect things to be ready from the word GO! Meanwhile, I do appreciate your comments. It is forcing me to write better code. Teaching an old fox new tricks ;-) > 1. about cache.rb > > The "safe = true" argument in 'set' and 'directory' seems > bad idea. I think there is no need to give insecure options > to users. I'll remove it if you wish. I think it is up to the implementor - if you have a web service you better use the default safe mode. Otherwise, who cares. I, for one, would like to use /tmp in some cases. > In 'directory' method, > > cache = Dir.mktmpdir(subdir) > > The Dir.mktmpdir method is a new feature added in Ruby 1.8.7, > and not available in 1.8.6 and older versions. > Because most users are still using Ruby 1.8.5 and 1.8.6, > to avoid using Dir.mktmpdir is currently a choice. > Alternatively, write a document that the feature can work > only in Ruby 1.8.7 or later. Yes we can document that. Using microarray bindings a later Ruby is a good idea anyway. > Note that current requirement of BioRuby is > "Ruby 1.8.2 or later (Ruby 1.8.4 or later is recommended)". > Also note that FileUtils.remove_entry_secure was introduced > in Ruby 1.8.3. Well, the modules are optionally included. It shouldn't break if people don't use the microarray stuff. This is true for the dependency on external biolib too. > Finally, I'm wondering if the Cache class can still be > a singleton or not in the future. Currently, only NCBI_GEO > is using the cache, but if it were used from many classes > with different data formats, files with different formats > would be existed in the same cache directory, and file name > conflicts might be happened. This implementation is such that we create a shared dir, with classes using different subfolders - i.e. tmpdir/GEO/. This prevents name clashes between modules. My current GEO cache is 30 Mb. If I were to download that every time my research would be severely hampered. I think it is very useful and could also be for running webservices of other modules. You don't want web servers to retain everything in memory. > 2. About file locations > > Below are recommended to be moved to bio/io/, > because their main purpose is file or network I/O, > and not data parsing. > bio/db/microarray/cache.rb OK. > Bio::Microarray::GEO::XML in bio/db/microarray/ncbi_geo/geo.rb It does NCBI XML parsing - but that is not what you mean? > The class/module names are not needed to be changed. > > The files with external dependency to the "biolib" might > also be suggested to be moved from bio/db to the other > location, but no best location found. heh - anyone else a suggestiong? The biolib stuff does do microarray loading and will do normalization and analysis soon. > 3. BIo::Microarray::NCBI_GEO > > In bio/db/microarray/ncbi_geo/geo.rb, > > > include REXML > > If the aim to include REXML module is only to skip the > REXML:: prefix, I don't like to include it in library, > because the constants and methods defined in REXML are > mixed and they might cause bad side effects. > (Note that unlike in a library, it is free to include > anything in an application.) OK > > def XML::create(acc) > > In my impression, the method name "XML.create" might be > reserved to be used by a method to create XML data structure > from scratch or from some data. > To define a class method, I like 'def self.create(acc)' > because it is easy to change class (module) name. It is a class factory. I'll have a think. > > def XML::fetch(xmlfn, acc) > > url = "http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=#{acc}&form=xml&view=brief&retmode=xml" > > URI escaping is needed, e.g. acc=#{URI.escape(acc)} > > > print "Fetching ",url,"\n" if $VERBOSE > > r = Net::HTTP.get_response( URI.parse( url ) ) > > To support proxy, use Bio::Command.get_uri(url). OK and OK > > def XML::valid_accession?(acc = nil) > > acc = @acc if not acc > > acc =~ /^(GSM|GSE|GPL)\d+$/ > > If "GSM0123\nGSM4567" is invalid, the regular expression > should be /\A(GSM|GSE|GPL)\d+\z/ . good point. > > def XML::parsexml(acc) > > Is there no way to get input XML data as String? Sigh. Sure there is. Of from a file. An IO object would be cool. Maybe the next version. > > if XML::valid_accession? acc > > cache = Cache.instance.directory > > fn = cache+'/'+acc+'.xml' > > Please use File.join. Sorry. OK. Pj. From davide.rambaldi at ifom-ieo-campus.it Thu Sep 25 03:35:54 2008 From: davide.rambaldi at ifom-ieo-campus.it (Davide Rambaldi) Date: Thu, 25 Sep 2008 09:35:54 +0200 Subject: [BioRuby] test and bioruby shell questions In-Reply-To: <992DF6FE-14DF-45B9-B141-224FF473E82B@hgc.jp> References: <3AF7B27B-E642-48FD-A612-ED88E9FF28BC@ifom-ieo-campus.it> <992DF6FE-14DF-45B9-B141-224FF473E82B@hgc.jp> Message-ID: <6B068F99-56FA-4E7E-AB60-887B83480F05@ifom-ieo-campus.it> On Aug 30, 2008, at 2:16 PM, Toshiaki Katayama wrote: > The demo above was designed to utilize the KEGG API, which is a > SOAP based web service, > so we need to change the default data source to obtain this entry. > We can fix this by switching to use NCBI's efetch method instead. I manage to write a fix for this... is really horrible actually (but it works) I have inserted my code in the nested if/else that retrieve the entry, so after the KEGG API try, the shell try NCBI::REST.efetch Oni:~/src/bioruby tucano$ git diff lib/bio/shell/plugin/entry.rb diff --git a/lib/bio/shell/plugin/entry.rb b/lib/bio/shell/plugin/ entry.rb index 6d36fb5..0a45ecd 100644 --- a/lib/bio/shell/plugin/entry.rb +++ b/lib/bio/shell/plugin/entry.rb @@ -88,8 +88,16 @@ module Bio::Shell # KEGG API at http://www.genome.jp/kegg/soap/ else - puts "Retrieving entry from KEGG API (#{arg})" entry = bget(arg) + if $?.exitstatus == 0 and str.length != 0 + puts "Retrieving entry from KEGG API (#{arg})" + else + # efetch from NCBI + puts "Retrieving entry from NCBI (#{arg})" + require 'bio/io/ncbirest.rb' + fetch = Bio::NCBI::REST.efetch("AF237819", {"db"=>"nuccore", "rettype"=>"gb"}) + entry = fetch.to_s + end end end So the questions/comments: 1. I have added the require 'bio/io/ncbirest.rb' beacuse in bio.rb ncbirest.rb is not loaded (only SOAP). Is a bug or a feature? 2. the standard demo command now is able to retrieve the genbank entry, but generate an error in the MIDI file generation: bioruby> midifile("data/AF237819.mid", kuma.naseq) Saving MIDI file (data/AF237819.mid) ... Error: Failed to save (data/ AF237819.mid) : No such file or directory - data/AF237819.mid any clue for this FAil? by the way: wow a module to translate a sequence in music? I really wont to test it also! I have made a software that do something similar: http://recipient.cc/playgene/ Is made with a perl (?!) script to efetch sequence from NCBI, and a flash application for the interface and to load the music library... :-) Best Regards Davide Rambaldi, Bioinformatics PhD student. ----------------------------------------------------- Bioinformatic Group IFOM-IEO Campus Via Adamello 16, Milano I-20139 Italy [t] +39 02574303 066 [e] davide.rambaldi at ifom-ieo-campus.it [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/DavideRambaldi (homepage) [i] http://www.semm.it (PhD school) [i] http://www.btbs.unimib.it/ (Master) ----------------------------------------------------- From ngoto at gen-info.osaka-u.ac.jp Thu Sep 25 10:58:17 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 25 Sep 2008 23:58:17 +0900 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080924162924.GA19778@thebird.nl> References: <20080902065055.GA29634@thebird.nl> <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> <20080902091958.GA31400@thebird.nl> <20080909113816.GA10051@thebird.nl> <20080910014822.36F5F1CBC3D3@idnmail.gen-info.osaka-u.ac.jp> <20080910074858.GA16861@thebird.nl> <20080918031700.074B21CBC498@idnmail.gen-info.osaka-u.ac.jp> <20080918063237.GA17631@thebird.nl> <20080923115852.GA6808@thebird.nl> <20080924133825.B61B41CBC3D8@idnmail.gen-info.osaka-u.ac.jp> <20080924162924.GA19778@thebird.nl> Message-ID: <20080925145817.90B1B1CBC3F4@idnmail.gen-info.osaka-u.ac.jp> Hi, On Wed, 24 Sep 2008 18:29:24 +0200 pjotr2008 at thebird.nl (Pjotr Prins) wrote: > Hi Naohisa, > > On Wed, Sep 24, 2008 at 10:38:19PM +0900, Naohisa GOTO wrote: > > Hi Pjotr, > > > > I've seen files in your lib/bio/db/microarray, and I suppose > > it's still under development and it will be changed frequently, > > and I think it's not a time to include them in main bioruby. > > So, my comments below are mainly for future improvements. > > What there is is 'stable'. Certainly the NCBI stuff is rather complete. The > biolib libraries could go in later. It is up to you, but I think it would be > nice to have mainstream microarray support before one of the other Bio* > libraries (and biolib support is there for all). We don't want to be beaten by > BioPerl, for one ;-). If nothing else I can make a BioRuby-with-Microarrays gem > available - but that may be confusing for others. I agree it is good to have microarray support, if it is useful. Could you please show short examples and use cases of the microarray support? > Another thing, what is the point of open source software if no one tests it. > How about regularly releasing a testing version of bioruby? We see some more > activity in BioRuby - which is a good thing. You can't expect things to be > ready from the word GO! I think new version should be released soon, but currently, there is no release management. > Meanwhile, I do appreciate your comments. It is forcing me to write better > code. Teaching an old fox new tricks ;-) > > > 1. about cache.rb > > > > The "safe = true" argument in 'set' and 'directory' seems > > bad idea. I think there is no need to give insecure options > > to users. > > I'll remove it if you wish. I think it is up to the implementor - if you have a > web service you better use the default safe mode. Otherwise, who cares. I, for > one, would like to use /tmp in some cases. I wish it is to be removed. Recently, temporary file vulnerability in software not directly related to server services have also been treated as security issue, e.g. f2c (fortran to C converter) http://www.debian.org/security/2005/dsa-661 So, it's good not to give a chance of insecure operation. > > In 'directory' method, > > > cache = Dir.mktmpdir(subdir) > > > > The Dir.mktmpdir method is a new feature added in Ruby 1.8.7, > > and not available in 1.8.6 and older versions. > > Because most users are still using Ruby 1.8.5 and 1.8.6, > > to avoid using Dir.mktmpdir is currently a choice. > > Alternatively, write a document that the feature can work > > only in Ruby 1.8.7 or later. > > Yes we can document that. Using microarray bindings a later Ruby is a > good idea anyway. OK. Question: Does the microarray support work on Ruby 1.9? Most part of bioruby still do not support Ruby 1.9, though some code can run on Ruby 1.9. > > Note that current requirement of BioRuby is > > "Ruby 1.8.2 or later (Ruby 1.8.4 or later is recommended)". > > Also note that FileUtils.remove_entry_secure was introduced > > in Ruby 1.8.3. > > Well, the modules are optionally included. It shouldn't break if > people don't use the microarray stuff. This is true for the dependency > on external biolib too. OK. > > Finally, I'm wondering if the Cache class can still be > > a singleton or not in the future. Currently, only NCBI_GEO > > is using the cache, but if it were used from many classes > > with different data formats, files with different formats > > would be existed in the same cache directory, and file name > > conflicts might be happened. > > This implementation is such that we create a shared dir, with classes using > different subfolders - i.e. tmpdir/GEO/. This prevents name clashes between > modules. My current GEO cache is 30 Mb. If I were to download that every time > my research would be severely hampered. I think it is very useful and could > also be for running webservices of other modules. You don't want web servers > to retain everything in memory. In the current implementation, the singleton object stores @subdir, and it is the same as a global variable. For example, If a user want to get both GEO and ArrayExpress (hopefully supported in the future), and I wrote a code like this: Bio::Microarray::Cache.set('/home/who/.bioruby-cache') obj1 = Bio::Microarray::GEO::GSE.new('GSE1') obj2 = Bio::Microarray::ArrayExpress.new('Acc2') obj3 = Bio::Microarray::GEO::GSE.new('GSE3') obj4 = Bio::Microarray::ArrayExpress.new('Acc4') In this case, how to specify sub directory? Or, am I misunderstanding @subdir? BTW, FYI, there is memcached, on-memory cache for web server. http://www.danga.com/memcached/ > > 2. About file locations > > > > Below are recommended to be moved to bio/io/, > > because their main purpose is file or network I/O, > > and not data parsing. > > bio/db/microarray/cache.rb > > OK. > > > Bio::Microarray::GEO::XML in bio/db/microarray/ncbi_geo/geo.rb > > It does NCBI XML parsing - but that is not what you mean? I meant only XML.create, XML.fetch, and XML.parsexml methods. But, because they are short, I think again that no need to move them. For microarray data, or for large-scale data, because of efficiency, I can understand that close relationship between I/O and data format class is needed. However, from the viewpoint to treat various data from various databases, separating I/O and data parsing is better, maybe in the future. > > The class/module names are not needed to be changed. > > > > The files with external dependency to the "biolib" might > > also be suggested to be moved from bio/db to the other > > location, but no best location found. > > heh - anyone else a suggestiong? The biolib stuff does do microarray loading > and will do normalization and analysis soon. > > > 3. BIo::Microarray::NCBI_GEO > > > > In bio/db/microarray/ncbi_geo/geo.rb, > > > > > include REXML > > > > If the aim to include REXML module is only to skip the > > REXML:: prefix, I don't like to include it in library, > > because the constants and methods defined in REXML are > > mixed and they might cause bad side effects. > > (Note that unlike in a library, it is free to include > > anything in an application.) > > OK > > > > def XML::create(acc) > > > > In my impression, the method name "XML.create" might be > > reserved to be used by a method to create XML data structure > > from scratch or from some data. > > > To define a class method, I like 'def self.create(acc)' > > because it is easy to change class (module) name. > > It is a class factory. I'll have a think. I suggest Bio::Microarray::GEO::XML.new(acc). > > > def XML::fetch(xmlfn, acc) > > > url = "http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=#{acc}&form=xml&view=brief&retmode=xml" > > > > URI escaping is needed, e.g. acc=#{URI.escape(acc)} > > > > > print "Fetching ",url,"\n" if $VERBOSE > > > r = Net::HTTP.get_response( URI.parse( url ) ) > > > > To support proxy, use Bio::Command.get_uri(url). > > OK and OK > > > > def XML::valid_accession?(acc = nil) > > > acc = @acc if not acc > > > acc =~ /^(GSM|GSE|GPL)\d+$/ > > > > If "GSM0123\nGSM4567" is invalid, the regular expression > > should be /\A(GSM|GSE|GPL)\d+\z/ . > > good point. > > > > def XML::parsexml(acc) > > > > Is there no way to get input XML data as String? > > Sigh. Sure there is. Of from a file. An IO object would be cool. > Maybe the next version. > > > > if XML::valid_accession? acc > > > cache = Cache.instance.directory > > > fn = cache+'/'+acc+'.xml' > > > > Please use File.join. > > Sorry. OK. > > Pj. > Thanks, Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From davide.rambaldi at ifom-ieo-campus.it Thu Sep 25 12:18:39 2008 From: davide.rambaldi at ifom-ieo-campus.it (Davide Rambaldi) Date: Thu, 25 Sep 2008 18:18:39 +0200 Subject: [BioRuby] bioruby shell Message-ID: <32C805D6-081B-4616-BBE2-26645CCC8146@ifom-ieo-campus.it> Hello, I have posted as a reply in another thread a small modification to lib/bio/shell/plugin/entry.rb that resolve the demo problem (die after trying to download a genbank from KeggAPI): diff --git a/lib/bio/shell/plugin/entry.rb b/lib/bio/shell/plugin/ entry.rb index 6d36fb5..0a45ecd 100644 --- a/lib/bio/shell/plugin/entry.rb +++ b/lib/bio/shell/plugin/entry.rb @@ -88,8 +88,16 @@ module Bio::Shell # KEGG API at http://www.genome.jp/kegg/soap/ else - puts "Retrieving entry from KEGG API (#{arg})" entry = bget(arg) + if $?.exitstatus == 0 and str.length != 0 + puts "Retrieving entry from KEGG API (#{arg})" + else + # efetch from NCBI + puts "Retrieving entry from NCBI (#{arg})" + require 'bio/io/ncbirest.rb' + fetch = Bio::NCBI::REST.efetch("AF237819", {"db"=>"nuccore", "rettype"=>"gb"}) + entry = fetch.to_s + end end end I have some other ideas for the shell: - adding a method to remove all saved objects - making an help (at least for demo, ls and rm commands) - adding an OptionParser In general I want to propose some other simple modification to this part of the bioruby library. I am losing my time? there is another person on this? or I can go on? Many thanks for feedback P.S: my simple BLAT application blatanalyzer is now accessible via svn at svn checkout svn://rubyforge.org/var/svn/blatanalyzer/trunk any feedback is really appreciated thanks again Davide Rambaldi, Bioinformatics PhD student. ----------------------------------------------------- Bioinformatic Group IFOM-IEO Campus Via Adamello 16, Milano I-20139 Italy [t] +39 02574303 066 [e] davide.rambaldi at ifom-ieo-campus.it [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/DavideRambaldi (homepage) [i] http://www.semm.it (PhD school) [i] http://www.btbs.unimib.it/ (Master) ----------------------------------------------------- From ngoto at gen-info.osaka-u.ac.jp Fri Sep 26 09:37:33 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Fri, 26 Sep 2008 22:37:33 +0900 Subject: [BioRuby] test and bioruby shell questions In-Reply-To: <6B068F99-56FA-4E7E-AB60-887B83480F05@ifom-ieo-campus.it> References: <3AF7B27B-E642-48FD-A612-ED88E9FF28BC@ifom-ieo-campus.it> <992DF6FE-14DF-45B9-B141-224FF473E82B@hgc.jp> <6B068F99-56FA-4E7E-AB60-887B83480F05@ifom-ieo-campus.it> Message-ID: <20080926133733.A99061CBC3F0@idnmail.gen-info.osaka-u.ac.jp> Hi, On Thu, 25 Sep 2008 09:35:54 +0200 Davide Rambaldi wrote: > On Aug 30, 2008, at 2:16 PM, Toshiaki Katayama wrote: > > > The demo above was designed to utilize the KEGG API, which is a > > SOAP based web service, > > so we need to change the default data source to obtain this entry. > > We can fix this by switching to use NCBI's efetch method instead. > > > > I manage to write a fix for this... is really horrible actually (but > it works) > I have inserted my code in the nested if/else that retrieve the > entry, so after the KEGG API try, the shell try NCBI::REST.efetch > > Oni:~/src/bioruby tucano$ git diff lib/bio/shell/plugin/entry.rb > diff --git a/lib/bio/shell/plugin/entry.rb b/lib/bio/shell/plugin/ > entry.rb > index 6d36fb5..0a45ecd 100644 > --- a/lib/bio/shell/plugin/entry.rb > +++ b/lib/bio/shell/plugin/entry.rb > @@ -88,8 +88,16 @@ module Bio::Shell > > # KEGG API at http://www.genome.jp/kegg/soap/ > else > - puts "Retrieving entry from KEGG API (#{arg})" > entry = bget(arg) > + if $?.exitstatus == 0 and str.length != 0 > + puts "Retrieving entry from KEGG API (#{arg})" > + else > + # efetch from NCBI > + puts "Retrieving entry from NCBI (#{arg})" > + require 'bio/io/ncbirest.rb' > + fetch = Bio::NCBI::REST.efetch("AF237819", > {"db"=>"nuccore", "rettype"=>"gb"}) > + entry = fetch.to_s > + end > end > end Thank you for a patch, but it has some problems: For KEGG API, $?.exitstatus has no mean, and no need to check $?. The "AF237819" should not be hardcoded because the method is not only for demo, but a bioruby-shell command to fetch entry specified by a user. Also note that "db" => "nuccore" would not always be good. (If result is empty, switching to another database and trying again would be the best way.) > So the questions/comments: > > 1. I have added the require 'bio/io/ncbirest.rb' beacuse in bio.rb > ncbirest.rb is not loaded (only SOAP). Is a bug or a feature? This is a bug, and it will soon be fixed. > 2. the standard demo command now is able to retrieve the genbank > entry, but generate an error in the MIDI file generation: > > bioruby> midifile("data/AF237819.mid", kuma.naseq) > Saving MIDI file (data/AF237819.mid) ... Error: Failed to save (data/ > AF237819.mid) : No such file or directory - data/AF237819.mid > > any clue for this FAil? The error may be caused because directory named "data" did not exist, and the program cannot save the file. To solve this, simply do "mkdir data". > by the way: wow a module to translate a sequence in music? I really > wont to test it also! I have made a software that do something similar: > > http://recipient.cc/playgene/ > > Is made with a perl (?!) script to efetch sequence from NCBI, and a > flash application for the interface and to load the music library... :-) Yes, you can enjoy music. Thanks, Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From pjotr2008 at thebird.nl Mon Sep 29 08:34:11 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Mon, 29 Sep 2008 14:34:11 +0200 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080925145817.90B1B1CBC3F4@idnmail.gen-info.osaka-u.ac.jp> References: <20080902091958.GA31400@thebird.nl> <20080909113816.GA10051@thebird.nl> <20080910014822.36F5F1CBC3D3@idnmail.gen-info.osaka-u.ac.jp> <20080910074858.GA16861@thebird.nl> <20080918031700.074B21CBC498@idnmail.gen-info.osaka-u.ac.jp> <20080918063237.GA17631@thebird.nl> <20080923115852.GA6808@thebird.nl> <20080924133825.B61B41CBC3D8@idnmail.gen-info.osaka-u.ac.jp> <20080924162924.GA19778@thebird.nl> <20080925145817.90B1B1CBC3F4@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20080929123411.GA31668@thebird.nl> Hi Naohisa, On Thu, Sep 25, 2008 at 11:58:17PM +0900, Naohisa GOTO wrote: > I agree it is good to have microarray support, if it is useful. > Could you please show short examples and use cases of the > microarray support? You mean, like load file, read probe? There are unit tests for that in BioLib. I'll expand on the Tutorial once this goes into BioRuby. > Question: Does the microarray support work on Ruby 1.9? > Most part of bioruby still do not support Ruby 1.9, > though some code can run on Ruby 1.9. I will test my sources with 1.9. Should be no problem - no legacy stuff in there. > In the current implementation, the singleton object stores > @subdir, and it is the same as a global variable. > For example, If a user want to get both GEO and ArrayExpress > (hopefully supported in the future), and I wrote a code > like this: > > Bio::Microarray::Cache.set('/home/who/.bioruby-cache') > obj1 = Bio::Microarray::GEO::GSE.new('GSE1') > obj2 = Bio::Microarray::ArrayExpress.new('Acc2') > obj3 = Bio::Microarray::GEO::GSE.new('GSE3') > obj4 = Bio::Microarray::ArrayExpress.new('Acc4') > > In this case, how to specify sub directory? > Or, am I misunderstanding @subdir? Well, hey! You are making life a little difficult for me here. In an earlier mail you wrote: > Note that some classes use Tempfile class, a standard bundled > class with Ruby by default, and the Tempfile class depends > on enviroment variables (TMPDIR, TMP, etc.). So I introduced tmpdir - which I had to remove later. Also you wrote: > I think cache isn't suitable for standard, because its purpose > may differ from program (or class, module, etc.) to program. so I introduce a cache specific to the GEO module. This Cache definition is for GEO and used as such. There are no conflicts with other modules now - as there are none. Loading on demand is not a solution - as that would be unusable. The upside of a Singleton is that a cache gets defined once - and is not part of the normal interfaces. Modules can define their own subdirectories in the Cache. That would be OK. Lets not take this further until someone wants to build on this cache. It is not my itch to scratch. Like you wrote earlier, a cache implementation is non-trivial. Right. I wasn't intending to do that. The cache we have now is safe and sufficient for this module. I'll stick in a warning not to use the cache for other purposes. OK? > > It is a class factory. I'll have a think. > > I suggest Bio::Microarray::GEO::XML.new(acc). Not sure about that. The definition of 'new' is tied to initializing a class. Here we have a factory method, we need to distinguish. Code should really document itself. I think my 'create' is actually fine for a factory, but if anyone has another suggestion? These examples all use 'create': http://www.scribd.com/doc/396559/gof-patterns-in-ruby Pj. From ngoto at gen-info.osaka-u.ac.jp Mon Sep 29 16:26:39 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa Goto) Date: Tue, 30 Sep 2008 05:26:39 +0900 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080929123411.GA31668@thebird.nl> References: <20080925145817.90B1B1CBC3F4@idnmail.gen-info.osaka-u.ac.jp> <20080929123411.GA31668@thebird.nl> Message-ID: <20080930052354.9B6C.EEF6E030@gen-info.osaka-u.ac.jp> Hi Pjotr, > Hi Naohisa, > > On Thu, Sep 25, 2008 at 11:58:17PM +0900, Naohisa GOTO wrote: > > I agree it is good to have microarray support, if it is useful. > > Could you please show short examples and use cases of the > > microarray support? > > You mean, like load file, read probe? There are unit tests for that > in BioLib. I'll expand on the Tutorial once this goes into BioRuby. OK. > > Question: Does the microarray support work on Ruby 1.9? > > Most part of bioruby still do not support Ruby 1.9, > > though some code can run on Ruby 1.9. > > I will test my sources with 1.9. Should be no problem - no legacy > stuff in there. Now, don't mind if it fails to run on Ruby 1.9. We will be gradually migrating to 1.9 after the relase of Ruby 1.9.1 in the future, not now. > > In the current implementation, the singleton object stores > > @subdir, and it is the same as a global variable. > > For example, If a user want to get both GEO and ArrayExpress > > (hopefully supported in the future), and I wrote a code > > like this: > > > > Bio::Microarray::Cache.set('/home/who/.bioruby-cache') > > obj1 = Bio::Microarray::GEO::GSE.new('GSE1') > > obj2 = Bio::Microarray::ArrayExpress.new('Acc2') > > obj3 = Bio::Microarray::GEO::GSE.new('GSE3') > > obj4 = Bio::Microarray::ArrayExpress.new('Acc4') > > > > In this case, how to specify sub directory? > > Or, am I misunderstanding @subdir? > > Well, hey! You are making life a little difficult for me here. In an > earlier mail you wrote: > > > Note that some classes use Tempfile class, a standard bundled > > class with Ruby by default, and the Tempfile class depends > > on enviroment variables (TMPDIR, TMP, etc.). > > So I introduced tmpdir - which I had to remove later. Also you wrote: > > > I think cache isn't suitable for standard, because its purpose > > may differ from program (or class, module, etc.) to program. > > so I introduce a cache specific to the GEO module. This Cache > definition is for GEO and used as such. There are no conflicts with > other modules now - as there are none. Loading on demand is not a > solution - as that would be unusable. The name "Bio::Microarray::Cache" sounds as if this were common to all microarray classes. To make clear the Cache is only for GEO, please move the class under Bio::Microarray::GEO, i.e. the class name is changed from Bio::Microarray::Cache to Bio::Microarray::GEO::Cache. In addition, please move the file to bio/db/microarray/ncbi_geo/cache.rb (no need to move under bio/io because it is specific to GEO and not intended to be used with other classes/modules). > The upside of a Singleton is that a cache gets defined once - and is > not part of the normal interfaces. Modules can define their own > subdirectories in the Cache. That would be OK. > > Lets not take this further until someone wants to build on this > cache. It is not my itch to scratch. Like you wrote earlier, a cache > implementation is non-trivial. Right. I wasn't intending to do that. > The cache we have now is safe and sufficient for this module. > > I'll stick in a warning not to use the cache for other purposes. OK? OK. In BioRuby, there are already many classes/modules/methods with warning documents "users should not use it directly", "internal use only", etc. > > > It is a class factory. I'll have a think. > > > > I suggest Bio::Microarray::GEO::XML.new(acc). > > Not sure about that. The definition of 'new' is tied to initializing a > class. Here we have a factory method, we need to distinguish. Code > should really document itself. I think my 'create' is actually fine > for a factory, but if anyone has another suggestion? These examples > all use 'create': > > http://www.scribd.com/doc/396559/gof-patterns-in-ruby "create" will be used, if no good suggestion given. Though, maybe bioscientists don't know much about design patterns. -- Naohisa Goto From pjotr2008 at thebird.nl Mon Sep 29 16:35:19 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Mon, 29 Sep 2008 22:35:19 +0200 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080930052354.9B6C.EEF6E030@gen-info.osaka-u.ac.jp> References: <20080925145817.90B1B1CBC3F4@idnmail.gen-info.osaka-u.ac.jp> <20080929123411.GA31668@thebird.nl> <20080930052354.9B6C.EEF6E030@gen-info.osaka-u.ac.jp> Message-ID: <20080929203519.GA5277@thebird.nl> On Tue, Sep 30, 2008 at 05:26:39AM +0900, Naohisa Goto wrote: > "create" will be used, if no good suggestion given. > Though, maybe bioscientists don't know much about design patterns. We oughta teach 'em ;-). But, yes. You are right. Pj. From donttrustben at gmail.com Mon Sep 29 21:55:35 2008 From: donttrustben at gmail.com (Ben Woodcroft) Date: Tue, 30 Sep 2008 11:55:35 +1000 Subject: [BioRuby] Bioruby Website Problems Message-ID: Hi, I am having problems using the bioruby.org site. My first problem was that the fetch function started giving me 404s: >> pdb = Bio::Fetch.new.fetch('PDB','2A06') OpenURI::HTTPError: 404 Not Found from /usr/lib/ruby/1.8/open-uri.rb:277:in `open_http' from /usr/lib/ruby/1.8/open-uri.rb:616:in `buffer_open' from /usr/lib/ruby/1.8/open-uri.rb:164:in `open_loop' from /usr/lib/ruby/1.8/open-uri.rb:162:in `catch' from /usr/lib/ruby/1.8/open-uri.rb:162:in `open_loop' from /usr/lib/ruby/1.8/open-uri.rb:132:in `open_uri' from /home/ben/forays/bioruby_wwood/lib/bio/command.rb:223:in `read_uri' from /home/ben/forays/bioruby_wwood/lib/bio/io/fetch.rb:109:in `fetch' from (irb):13 Went to bioruby.org/rdoc in firefox and that also fails. bioruby.org itself redirects to the Human Genome Center (Tokyo Uni) front page. Thanks, ben -- FYI: My email addresses at unimelb, uq and gmail all redirect to the same place. From donttrustben at gmail.com Mon Sep 29 22:40:39 2008 From: donttrustben at gmail.com (Ben Woodcroft) Date: Tue, 30 Sep 2008 12:40:39 +1000 Subject: [BioRuby] biofetch confusion Message-ID: Hi, I was running a fetch from within ruby using the alternate server, and ran into a problem it took stupid me a little while to figure out. Thought I might post to help others. >> pdb = Bio::Fetch.new('www.ebi.ac.uk/cgi-bin/dbfetch').fetch('pdb','2A06') NoMethodError: You have a nil object when you didn't expect it! The error occurred while evaluating nil.downcase from /usr/lib/ruby/1.8/open-uri.rb:551:in `find_proxy' from /usr/lib/ruby/1.8/open-uri.rb:147:in `open_loop' from /usr/lib/ruby/1.8/open-uri.rb:164:in `call' from /usr/lib/ruby/1.8/open-uri.rb:164:in `open_loop' from /usr/lib/ruby/1.8/open-uri.rb:162:in `catch' from /usr/lib/ruby/1.8/open-uri.rb:162:in `open_loop' from /usr/lib/ruby/1.8/open-uri.rb:132:in `open_uri' from /home/ben/forays/bioruby_wwood/lib/bio/command.rb:223:in `read_uri' from /home/ben/forays/bioruby_wwood/lib/bio/io/fetch.rb:109:in `fetch' from (irb):5 Same problem happens when you use the br_biofetch.rb script directly. The problem was fixed by adding 'http://' to the front of the url: >> pdb = Bio::Fetch.new('http://www.ebi.ac.uk/cgi-bin/dbfetch').fetch('pdb ','2A06') ... pdb printed here ... Should bioruby add the http:// somehow? Thanks, ben -- FYI: My email addresses at unimelb, uq and gmail all redirect to the same place. From ktym at hgc.jp Mon Sep 29 22:47:45 2008 From: ktym at hgc.jp (Toshiaki Katayama) Date: Tue, 30 Sep 2008 11:47:45 +0900 Subject: [BioRuby] biofetch confusion In-Reply-To: References: Message-ID: <8A9B37C3-64AA-4CDA-AB32-E7C4155AECE7@hgc.jp> Hi, > Should bioruby add the http:// somehow? I don't think so. Please add protocol prefix by yourself. Toshiaki On 2008/09/30, at 11:40, Ben Woodcroft wrote: > Hi, > > I was running a fetch from within ruby using the alternate server, and ran > into a problem it took stupid me a little while to figure out. Thought I > might post to help others. > >>> pdb = Bio::Fetch.new('www.ebi.ac.uk/cgi-bin/dbfetch').fetch('pdb','2A06') > NoMethodError: You have a nil object when you didn't expect it! > The error occurred while evaluating nil.downcase > from /usr/lib/ruby/1.8/open-uri.rb:551:in `find_proxy' > from /usr/lib/ruby/1.8/open-uri.rb:147:in `open_loop' > from /usr/lib/ruby/1.8/open-uri.rb:164:in `call' > from /usr/lib/ruby/1.8/open-uri.rb:164:in `open_loop' > from /usr/lib/ruby/1.8/open-uri.rb:162:in `catch' > from /usr/lib/ruby/1.8/open-uri.rb:162:in `open_loop' > from /usr/lib/ruby/1.8/open-uri.rb:132:in `open_uri' > from /home/ben/forays/bioruby_wwood/lib/bio/command.rb:223:in `read_uri' > from /home/ben/forays/bioruby_wwood/lib/bio/io/fetch.rb:109:in `fetch' > from (irb):5 > > > Same problem happens when you use the br_biofetch.rb script directly. > > > The problem was fixed by adding 'http://' to the front of the url: > >>> pdb = Bio::Fetch.new('http://www.ebi.ac.uk/cgi-bin/dbfetch').fetch('pdb > ','2A06') > ... pdb printed here ... > > > Should bioruby add the http:// somehow? > > Thanks, > ben > > -- > FYI: My email addresses at unimelb, uq and gmail all redirect to the same > place. > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From ktym at hgc.jp Mon Sep 29 22:40:56 2008 From: ktym at hgc.jp (Toshiaki Katayama) Date: Tue, 30 Sep 2008 11:40:56 +0900 Subject: [BioRuby] Bioruby Website Problems In-Reply-To: References: Message-ID: <23868174-D2AD-43CC-8048-35827D17BDA5@hgc.jp> Hi, Sorry for any inconveniences. I forgot to care about this, but it is due to our server replacement. The bioruby.org services (including BioFetch server) will be unavailable until Oct 2nd. Meanwhile, I can recommend you to use TogoWS service hosted at http://togows.dbcls.jp/entry/pdb/2A06 which we have developed these months utilizing BioRuby functionality. Regards, Toshiaki Katayama On 2008/09/30, at 10:55, Ben Woodcroft wrote: > Hi, > > I am having problems using the bioruby.org site. My first problem was that > the fetch function started giving me 404s: > >>> pdb = Bio::Fetch.new.fetch('PDB','2A06') > OpenURI::HTTPError: 404 Not Found > from /usr/lib/ruby/1.8/open-uri.rb:277:in `open_http' > from /usr/lib/ruby/1.8/open-uri.rb:616:in `buffer_open' > from /usr/lib/ruby/1.8/open-uri.rb:164:in `open_loop' > from /usr/lib/ruby/1.8/open-uri.rb:162:in `catch' > from /usr/lib/ruby/1.8/open-uri.rb:162:in `open_loop' > from /usr/lib/ruby/1.8/open-uri.rb:132:in `open_uri' > from /home/ben/forays/bioruby_wwood/lib/bio/command.rb:223:in `read_uri' > from /home/ben/forays/bioruby_wwood/lib/bio/io/fetch.rb:109:in `fetch' > from (irb):13 > > Went to bioruby.org/rdoc in firefox and that also fails. bioruby.org itself > redirects to the Human Genome Center (Tokyo Uni) front page. > > Thanks, > ben > > -- > FYI: My email addresses at unimelb, uq and gmail all redirect to the same > place. > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From donttrustben at gmail.com Mon Sep 29 23:47:46 2008 From: donttrustben at gmail.com (Ben Woodcroft) Date: Tue, 30 Sep 2008 13:47:46 +1000 Subject: [BioRuby] Bioruby Website Problems In-Reply-To: <23868174-D2AD-43CC-8048-35827D17BDA5@hgc.jp> References: <23868174-D2AD-43CC-8048-35827D17BDA5@hgc.jp> Message-ID: Thanks for the quick reply. That togows site looks cool. 2008/9/30 Toshiaki Katayama > Hi, > > Sorry for any inconveniences. > > I forgot to care about this, but it is due to our server replacement. > The bioruby.org services (including BioFetch server) will be unavailable > until Oct 2nd. > > Meanwhile, I can recommend you to use TogoWS service hosted at > > http://togows.dbcls.jp/entry/pdb/2A06 > > which we have developed these months utilizing BioRuby functionality. > > Regards, > Toshiaki Katayama > > On 2008/09/30, at 10:55, Ben Woodcroft wrote: > > > Hi, > > > > I am having problems using the bioruby.org site. My first problem was > that > > the fetch function started giving me 404s: > > > >>> pdb = Bio::Fetch.new.fetch('PDB','2A06') > > OpenURI::HTTPError: 404 Not Found > > from /usr/lib/ruby/1.8/open-uri.rb:277:in `open_http' > > from /usr/lib/ruby/1.8/open-uri.rb:616:in `buffer_open' > > from /usr/lib/ruby/1.8/open-uri.rb:164:in `open_loop' > > from /usr/lib/ruby/1.8/open-uri.rb:162:in `catch' > > from /usr/lib/ruby/1.8/open-uri.rb:162:in `open_loop' > > from /usr/lib/ruby/1.8/open-uri.rb:132:in `open_uri' > > from /home/ben/forays/bioruby_wwood/lib/bio/command.rb:223:in > `read_uri' > > from /home/ben/forays/bioruby_wwood/lib/bio/io/fetch.rb:109:in `fetch' > > from (irb):13 > > > > Went to bioruby.org/rdoc in firefox and that also fails. bioruby.orgitself > > redirects to the Human Genome Center (Tokyo Uni) front page. > > > > Thanks, > > ben > > > > -- > > FYI: My email addresses at unimelb, uq and gmail all redirect to the same > > place. > > _______________________________________________ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > -- FYI: My email addresses at unimelb, uq and gmail all redirect to the same place. From donttrustben at gmail.com Tue Sep 30 00:21:12 2008 From: donttrustben at gmail.com (Ben Woodcroft) Date: Tue, 30 Sep 2008 14:21:12 +1000 Subject: [BioRuby] Bio::SPTR bug and fix Message-ID: Hi, So I was trying to parse a uniprot file, and I found that bioruby threw an error when asked it to return a DR key that didn't exist in the uniprot file (in particular, GO annotations when none were defined). I made a branch that fixes this by returning [] in that situation, and added a test for it as well: http://github.com/wwood/bioruby/tree/sptr_fix If this code is good enough then can I request it be merged into the tree? Thanks, ben -- FYI: My email addresses at unimelb, uq and gmail all redirect to the same place. From ngoto at gen-info.osaka-u.ac.jp Tue Sep 30 05:05:44 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Tue, 30 Sep 2008 18:05:44 +0900 Subject: [BioRuby] Bio::SPTR bug and fix In-Reply-To: References: Message-ID: <20080930090545.2D0B81CBC3AB@idnmail.gen-info.osaka-u.ac.jp> Thank you. I modified your patch and committed to my repository. http://github.com/ngoto/bioruby/commit/6299d291b925442d828ff2a95c4526c45dc62208 It will soon be merged to the main bioruby git repo. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Tue, 30 Sep 2008 14:21:12 +1000 "Ben Woodcroft" wrote: > Hi, > > So I was trying to parse a uniprot file, and I found that bioruby threw an > error when asked it to return a DR key that didn't exist in the uniprot file > (in particular, GO annotations when none were defined). > > I made a branch that fixes this by returning [] in that situation, and added > a test for it as well: > http://github.com/wwood/bioruby/tree/sptr_fix > > If this code is good enough then can I request it be merged into the tree? > > Thanks, > ben > > -- > FYI: My email addresses at unimelb, uq and gmail all redirect to the same > place. > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From donttrustben at gmail.com Tue Sep 30 19:09:06 2008 From: donttrustben at gmail.com (Ben Woodcroft) Date: Wed, 1 Oct 2008 09:09:06 +1000 Subject: [BioRuby] Bio::SPTR bug and fix In-Reply-To: <20080930090545.2D0B81CBC3AB@idnmail.gen-info.osaka-u.ac.jp> References: <20080930090545.2D0B81CBC3AB@idnmail.gen-info.osaka-u.ac.jp> Message-ID: Thanks. You are the first person to call me Dr. Ben Woodcroft - while I don't mind the sound of that I'm still a first year PhD student. ben 2008/9/30 Naohisa GOTO > Thank you. > > I modified your patch and committed to my repository. > > > http://github.com/ngoto/bioruby/commit/6299d291b925442d828ff2a95c4526c45dc62208 > > It will soon be merged to the main bioruby git repo. > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > On Tue, 30 Sep 2008 14:21:12 +1000 > "Ben Woodcroft" wrote: > > > Hi, > > > > So I was trying to parse a uniprot file, and I found that bioruby threw > an > > error when asked it to return a DR key that didn't exist in the uniprot > file > > (in particular, GO annotations when none were defined). > > > > I made a branch that fixes this by returning [] in that situation, and > added > > a test for it as well: > > http://github.com/wwood/bioruby/tree/sptr_fix > > > > If this code is good enough then can I request it be merged into the > tree? > > > > Thanks, > > ben > > > > -- > > FYI: My email addresses at unimelb, uq and gmail all redirect to the same > > place. > > _______________________________________________ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > -- FYI: My email addresses at unimelb, uq and gmail all redirect to the same place. From ngoto at gen-info.osaka-u.ac.jp Mon Sep 1 10:44:06 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Mon, 1 Sep 2008 19:44:06 +0900 Subject: [BioRuby] test and bioruby shell questions In-Reply-To: References: <3AF7B27B-E642-48FD-A612-ED88E9FF28BC@ifom-ieo-campus.it> <20080831042546.D246F1CBC56E@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20080901104407.6D8821CBC40F@idnmail.gen-info.osaka-u.ac.jp> Hi Ben, The failures 4) to 7) may be caused by the conflicts of test class names. I changed test class names to fix this. (commits 536cdf903a3c3908c117efd554d33117d91452f4 and 0fe1e7d3ed02185632f4a34d8efe1f21f755b289). Current HEAD is: http://github.com/bioruby/bioruby/commit/0fe1e7d3ed02185632f4a34d8efe1f21f755b289 Note that the first three failures are still unfixed. Could you please try again? Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Sun, 31 Aug 2008 17:11:31 +1000 "Ben Woodcroft" wrote: > Hi, > > Thanks for your concern. > After pulling from the newest github - > http://github.com/bioruby/bioruby/commit/e86f8d757c45805389e154f06ccde5a3d9e8a557 > > $ ruby -v > ruby 1.8.6 (2007-09-24 patchlevel 111) [i486-linux] > $ uname -a > Linux uyen 2.6.24-21-generic #1 SMP Mon Aug 25 17:32:09 UTC 2008 i686 GNU/Linux > > Using Ubuntu Hardy, and the latest patched version of the ruby1.8 > package (1.8.6.111-2ubuntu1.1) > > $ ruby runner.rb > Loaded suite . > Started > .....FF..F.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................. .! > ..............................................................................................................................................................................................................................................................................................F............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................FFE................................................ > Finished in 142.816902 seconds. > > 1) Failure: > test_gff_exportview(Bio::FuncTestEnsemblHuman) > [./functional/bio/io/test_ensembl.rb:95]: > <"4\tEnsembl\tGene\t1148366\t1151952\t.\t+\t1\tgene_id=ENSG00000206158; > transcript_id=ENST00000382964; exon_id=ENSE00001494097; > gene_type=KNOWN_protein_coding\n"> expected but was > <"">. > > 2) Failure: > test_gff_exportview_with_named_args(Bio::FuncTestEnsemblHuman) > [./functional/bio/io/test_ensembl.rb:121]: > <"4\tEnsembl\tGene\t1148366\t1151952\t.\t+\t1\tgene_id=ENSG00000206158; > transcript_id=ENST00000382964; exon_id=ENSE00001494097; > gene_type=KNOWN_protein_coding\n"> expected but was > <"">. > > 3) Failure: > test_tab_exportview_with_named_args(Bio::FuncTestEnsemblHuman) > [./functional/bio/io/test_ensembl.rb:180]: > <"seqname\tsource\tfeature\tstart\tend\tscore\tstrand\tframe\tgene_id\ttranscript_id\texon_id\tgene_type\n4\tEnsembl\tGene\t1148366\t1151952\t.\t+\t1\tENSG00000206158\tENST00000382964\tENSE00001494097\tKNOWN_protein_coding\n"> > expected but was > <"seqname\tsource\tfeature\tstart\tend\tscore\tstrand\tframe\tgene_id\ttranscript_id\texon_id\tgene_type\n">. > > 4) Failure: > test_id_line_sequence_version(Bio::TestEMBL) > [./unit/bio/db/embl/test_embl_rel89.rb:45]: > <"1"> expected but was > . > > 5) Failure: > test_left_padding(Bio::TestStringFormatting) > [./unit/bio/util/restriction_enzyme/test_string_formatting.rb:43]: > <"nnnnnnn"> expected but was > <"">. > > 6) Failure: > test_right_padding(Bio::TestStringFormatting) > [./unit/bio/util/restriction_enzyme/test_string_formatting.rb:50]: > <"nn"> expected but was > <"">. > > 7) Error: > test_strip_padding(Bio::TestStringFormatting): > NoMethodError: undefined method `[]' for nil:NilClass > ../lib/bio/util/restriction_enzyme/string_formatting.rb:64:in > `strip_padding' > ./unit/bio/util/restriction_enzyme/test_string_formatting.rb:33:in > `test_strip_padding' > > 1867 tests, 4049 assertions, 6 failures, 1 errors > > I don't actually care personally about these problems, but am glad to > help out in a general sense. > > Thanks, > ben From davide.rambaldi at ifom-ieo-campus.it Mon Sep 1 11:44:36 2008 From: davide.rambaldi at ifom-ieo-campus.it (Davide Rambaldi) Date: Mon, 1 Sep 2008 13:44:36 +0200 Subject: [BioRuby] test and bioruby shell questions Message-ID: On Aug 31, 2008, at 6:05 AM, Naohisa GOTO wrote: > Next time, please show all failure message, even if long. Dear Naohisa In attachment you find the complete report on the test in bioruby (I forgot to put in the first mail ... :P ) MyPlatform: bioruby 1.2.1 and ruby 1.8.7 on a Power PC G4 osx 10.4.11 TEST OUTPUT: -------------- next part -------------- Best Regards > Davide Rambaldi, Bioinformatics PhD student. ----------------------------------------------------- Bioinformatic Group IFOM-IEO Campus Via Adamello 16, Milano I-20139 Italy [t] +39 02574303 066 [e] davide.rambaldi at ifom-ieo-campus.it [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/DavideRambaldi (homepage) [i] http://www.semm.it (PhD school) [i] http://www.btbs.unimib.it/ (Master) ----------------------------------------------------- From davide.rambaldi at ifom-ieo-campus.it Mon Sep 1 12:18:03 2008 From: davide.rambaldi at ifom-ieo-campus.it (Davide Rambaldi) Date: Mon, 1 Sep 2008 14:18:03 +0200 Subject: [BioRuby] test and bioruby shell questions In-Reply-To: <992DF6FE-14DF-45B9-B141-224FF473E82B@hgc.jp> References: <3AF7B27B-E642-48FD-A612-ED88E9FF28BC@ifom-ieo-campus.it> <992DF6FE-14DF-45B9-B141-224FF473E82B@hgc.jp> Message-ID: On Aug 30, 2008, at 2:16 PM, Toshiaki Katayama wrote: > bioruby> rm :a > > Actually, the rm command temporally assign 'nil' to the variable > because BioRuby shell will avoid to dump variables having 'nil' as > its value. > (This means, the memory will not be returned to the OS until next GC.) > > This implementation looks somewhat ugly, so if you have a better > idea, please let me know. Dear Toshiaki, I have implemented another method (rm2) that extend your rm command using a case statement and the === (case equality) I practice, if he found a String or a Symbol I just call rm(name), while if he find an Array, he iterate into the Array to call rm(e) on each element: def rm2(name) # check class case name when String, Symbol : rm(name) when Array : name.each do |e| rm(e) end end end I allow to use this kind of commands: rm2(list=ls()) <-- R console style! :P Put the list of current objects into an Array named list, then remove all! Obviously is inspired from the R console :P (that have the same command) I have putted him directly into the bin/bioruby.rb file to test and seems to work... tell me if is useful! and don't esitate to add him to the current code if you think is a good idea. cheers and best regards! Davide Rambaldi, Bioinformatics PhD student. ----------------------------------------------------- Bioinformatic Group IFOM-IEO Campus Via Adamello 16, Milano I-20139 Italy [t] +39 02574303 066 [e] davide.rambaldi at ifom-ieo-campus.it [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/DavideRambaldi (homepage) [i] http://www.semm.it (PhD school) [i] http://www.btbs.unimib.it/ (Master) ----------------------------------------------------- From donttrustben at gmail.com Mon Sep 1 13:01:47 2008 From: donttrustben at gmail.com (Ben Woodcroft) Date: Mon, 1 Sep 2008 23:01:47 +1000 Subject: [BioRuby] test and bioruby shell questions In-Reply-To: <20080901104407.6D8821CBC40F@idnmail.gen-info.osaka-u.ac.jp> References: <3AF7B27B-E642-48FD-A612-ED88E9FF28BC@ifom-ieo-campus.it> <20080831042546.D246F1CBC56E@idnmail.gen-info.osaka-u.ac.jp> <20080901104407.6D8821CBC40F@idnmail.gen-info.osaka-u.ac.jp> Message-ID: Hi, Your commits seem to fix things, as I only get the first 3 errors. Thanks again, ben $ ruby runner.rb Loaded suite . Started .....FF..F........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ Finished in 141.782259 seconds. 1) Failure: test_gff_exportview(Bio::FuncTestEnsemblHuman) [./functional/bio/io/test_ensembl.rb:95]: <"4\tEnsembl\tGene\t1148366\t1151952\t.\t+\t1\tgene_id=ENSG00000206158; transcript_id=ENST00000382964; exon_id=ENSE00001494097; gene_type=KNOWN_protein_coding\n"> expected but was <"">. 2) Failure: test_gff_exportview_with_named_args(Bio::FuncTestEnsemblHuman) [./functional/bio/io/test_ensembl.rb:121]: <"4\tEnsembl\tGene\t1148366\t1151952\t.\t+\t1\tgene_id=ENSG00000206158; transcript_id=ENST00000382964; exon_id=ENSE00001494097; gene_type=KNOWN_protein_coding\n"> expected but was <"">. 3) Failure: test_tab_exportview_with_named_args(Bio::FuncTestEnsemblHuman) [./functional/bio/io/test_ensembl.rb:180]: <"seqname\tsource\tfeature\tstart\tend\tscore\tstrand\tframe\tgene_id\ttranscript_id\texon_id\tgene_type\n4\tEnsembl\tGene\t1148366\t1151952\t.\t+\t1\tENSG00000206158\tENST00000382964\tENSE00001494097\tKNOWN_protein_coding\n"> expected but was <"seqname\tsource\tfeature\tstart\tend\tscore\tstrand\tframe\tgene_id\ttranscript_id\texon_id\tgene_type\n">. 1906 tests, 4111 assertions, 3 failures, 0 errors 2008/9/1 Naohisa GOTO > Hi Ben, > > The failures 4) to 7) may be caused by the conflicts of test class names. > I changed test class names to fix this. > (commits 536cdf903a3c3908c117efd554d33117d91452f4 and > 0fe1e7d3ed02185632f4a34d8efe1f21f755b289). > > Current HEAD is: > > http://github.com/bioruby/bioruby/commit/0fe1e7d3ed02185632f4a34d8efe1f21f755b289 > > Note that the first three failures are still unfixed. > > Could you please try again? > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > On Sun, 31 Aug 2008 17:11:31 +1000 > "Ben Woodcroft" wrote: > > > Hi, > > > > Thanks for your concern. > > After pulling from the newest github - > > > http://github.com/bioruby/bioruby/commit/e86f8d757c45805389e154f06ccde5a3d9e8a557 > > > > $ ruby -v > > ruby 1.8.6 (2007-09-24 patchlevel 111) [i486-linux] > > $ uname -a > > Linux uyen 2.6.24-21-generic #1 SMP Mon Aug 25 17:32:09 UTC 2008 i686 > GNU/Linux > > > > Using Ubuntu Hardy, and the latest patched version of the ruby1.8 > > package (1.8.6.111-2ubuntu1.1) > > > > $ ruby runner.rb > > Loaded suite . > > Started > > > .....FF..F.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................. > .! > > > ..............................................................................................................................................................................................................................................................................................F............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................FFE................................................ > > Finished in 142.816902 seconds. > > > > 1) Failure: > > test_gff_exportview(Bio::FuncTestEnsemblHuman) > > [./functional/bio/io/test_ensembl.rb:95]: > > <"4\tEnsembl\tGene\t1148366\t1151952\t.\t+\t1\tgene_id=ENSG00000206158; > > transcript_id=ENST00000382964; exon_id=ENSE00001494097; > > gene_type=KNOWN_protein_coding\n"> expected but was > > <"">. > > > > 2) Failure: > > test_gff_exportview_with_named_args(Bio::FuncTestEnsemblHuman) > > [./functional/bio/io/test_ensembl.rb:121]: > > <"4\tEnsembl\tGene\t1148366\t1151952\t.\t+\t1\tgene_id=ENSG00000206158; > > transcript_id=ENST00000382964; exon_id=ENSE00001494097; > > gene_type=KNOWN_protein_coding\n"> expected but was > > <"">. > > > > 3) Failure: > > test_tab_exportview_with_named_args(Bio::FuncTestEnsemblHuman) > > [./functional/bio/io/test_ensembl.rb:180]: > > > <"seqname\tsource\tfeature\tstart\tend\tscore\tstrand\tframe\tgene_id\ttranscript_id\texon_id\tgene_type\n4\tEnsembl\tGene\t1148366\t1151952\t.\t+\t1\tENSG00000206158\tENST00000382964\tENSE00001494097\tKNOWN_protein_coding\n"> > > expected but was > > > <"seqname\tsource\tfeature\tstart\tend\tscore\tstrand\tframe\tgene_id\ttranscript_id\texon_id\tgene_type\n">. > > > > 4) Failure: > > test_id_line_sequence_version(Bio::TestEMBL) > > [./unit/bio/db/embl/test_embl_rel89.rb:45]: > > <"1"> expected but was > > . > > > > 5) Failure: > > test_left_padding(Bio::TestStringFormatting) > > [./unit/bio/util/restriction_enzyme/test_string_formatting.rb:43]: > > <"nnnnnnn"> expected but was > > <"">. > > > > 6) Failure: > > test_right_padding(Bio::TestStringFormatting) > > [./unit/bio/util/restriction_enzyme/test_string_formatting.rb:50]: > > <"nn"> expected but was > > <"">. > > > > 7) Error: > > test_strip_padding(Bio::TestStringFormatting): > > NoMethodError: undefined method `[]' for nil:NilClass > > ../lib/bio/util/restriction_enzyme/string_formatting.rb:64:in > > `strip_padding' > > ./unit/bio/util/restriction_enzyme/test_string_formatting.rb:33:in > > `test_strip_padding' > > > > 1867 tests, 4049 assertions, 6 failures, 1 errors > > > > I don't actually care personally about these problems, but am glad to > > help out in a general sense. > > > > Thanks, > > ben > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > -- FYI: My email addresses at unimelb, uq and gmail all redirect to the same place. From pjotr2008 at thebird.nl Tue Sep 2 06:50:56 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Tue, 2 Sep 2008 08:50:56 +0200 Subject: [BioRuby] BioRuby standards Message-ID: <20080902065055.GA29634@thebird.nl> Hi everyone, I have been doing some work on microarray support for BioRuby, see http://github.com/pjotrp/bioruby/tree/bioruby-testing-pjotr There are two questions I want to raise about standards, as I see different solutions in the current tree. First is about error handling. Second about caching. 1) Error handling ought to print to stderr, and we need a consistent way of handling them, as well as a more fine grained approach towards warnings, info, debug etc. messages. Can we come up with a standard where a user can set these from outside Bioruby, e.g. through an environment setting. And what classes can we use for consistent messaging. Obviously a standard way for exceptions is part of that. 2) Web based tools often like to cache things on the local file system. I suggest using BIORUBY_CACHE as a standard environment variable. And, perhaps, BIORUBY_CACHE_SIZE, though that would require a module to monitor that. For (1) David Powers came up with a nice approach for the Cfruby project - where modules can override behaviour of the error handling (I wanted that for the Cfenjin application). See http://rubyforge.org/projects/cfruby/ and the source code at: http://cfruby.rubyforge.org/svn/lib/libcfruby/flowmonitor.rb with my usage: http://cfruby.rubyforge.org/svn/lib/libcfenjin/cfp_logger.rb http://cfruby.rubyforge.org/svn/lib/libcfenjin/cfp_flowmonitor.rb In my case I wanted to override the standard single switch for WARN, INFO, DEBUG etc., with a second switch for TRACING, VERBOSITY levels and TESTING. For BioRuby it is simpler, as we have (perhaps) have no such requirement at the library level. Pj. From ngoto at gen-info.osaka-u.ac.jp Tue Sep 2 08:47:11 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Tue, 2 Sep 2008 17:47:11 +0900 Subject: [BioRuby] BioRuby standards In-Reply-To: <20080902065055.GA29634@thebird.nl> References: <20080902065055.GA29634@thebird.nl> Message-ID: <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> Hi, On Tue, 2 Sep 2008 08:50:56 +0200 pjotr2008 at thebird.nl (Pjotr Prins) wrote: > Hi everyone, > > I have been doing some work on microarray support for BioRuby, see > > http://github.com/pjotrp/bioruby/tree/bioruby-testing-pjotr > > There are two questions I want to raise about standards, as I see > different solutions in the current tree. First is about error > handling. Second about caching. > > 1) Error handling ought to print to stderr, and we need a consistent > way of handling them, as well as a more fine grained approach towards > warnings, info, debug etc. messages. Can we come up with a standard > where a user can set these from outside Bioruby, e.g. through an > environment setting. And what classes can we use for consistent > messaging. Obviously a standard way for exceptions is part of that. As you said, no standards, but, empirically in BioRuby, * Small errors are simply ignored and the program continues. * When normal (but not severe) errors, prints warning messages to $stdout, and continues to process. * When severe error, raises error. > > 2) Web based tools often like to cache things on the local file > system. I suggest using BIORUBY_CACHE as a standard environment > variable. And, perhaps, BIORUBY_CACHE_SIZE, though that would require > a module to monitor that. Because BioRuby is a library (except for BioRuby Shell), it is generally not so good to depend on environment variables. Instead, to prepare APIs to set cache positions and sizes is better. Note that some classes use Tempfile class, a standard bundled class with Ruby by default, and the Tempfile class depends on enviroment variables (TMPDIR, TMP, etc.). I think cache isn't suitable for standard, because its purpose may differ from program (or class, module, etc.) to program. For example, if I want to put class A's cache on a fast hard disk with very large size, and program B's cache on a slower hard disk with small size, what should I do? > For (1) David Powers came up with a nice approach for the Cfruby > project - where modules can override behaviour of the error handling > (I wanted that for the Cfenjin application). See > > http://rubyforge.org/projects/cfruby/ > > and the source code at: > > http://cfruby.rubyforge.org/svn/lib/libcfruby/flowmonitor.rb > > with my usage: > > http://cfruby.rubyforge.org/svn/lib/libcfenjin/cfp_logger.rb > http://cfruby.rubyforge.org/svn/lib/libcfenjin/cfp_flowmonitor.rb > > In my case I wanted to override the standard single switch for WARN, > INFO, DEBUG etc., with a second switch for TRACING, VERBOSITY levels > and TESTING. For BioRuby it is simpler, as we have (perhaps) have no > such requirement at the library level. I've not seen this yet, but is it different from the Logger class, a standard bundled class with Ruby? http://www.ruby-doc.org/stdlib/libdoc/logger/rdoc/classes/Logger.html Thanks, -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From pjotr2008 at thebird.nl Tue Sep 2 09:19:58 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Tue, 2 Sep 2008 11:19:58 +0200 Subject: [BioRuby] BioRuby standards In-Reply-To: <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> References: <20080902065055.GA29634@thebird.nl> <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20080902091958.GA31400@thebird.nl> Hi Naohisa, Thanks for your reply. Some comments. On Tue, Sep 02, 2008 at 05:47:11PM +0900, Naohisa GOTO wrote: > As you said, no standards, but, empirically in BioRuby, > > * Small errors are simply ignored and the program continues. > * When normal (but not severe) errors, prints warning messages > to $stdout, and continues to process. > * When severe error, raises error. This is fine for an interactive program - like the shell. But it is not such a good strategy for software calling into Bioruby (think of a web server). I am unhappy with this state of things. Can we come up with something better? I think in the long term this will help predictability of BioRuby. > Because BioRuby is a library (except for BioRuby Shell), > it is generally not so good to depend on environment variables. Fair enough. > Instead, to prepare APIs to set cache positions and sizes > is better. That would be cool. That API could take care of environment options too, if we were ever to introduce them. > Note that some classes use Tempfile class, a standard bundled > class with Ruby by default, and the Tempfile class depends > on enviroment variables (TMPDIR, TMP, etc.). I noticed. Caching is a bit different in nature - as caches may be there for a long time. TMPDIRs get emptied on reboot, for one. > I think cache isn't suitable for standard, because its purpose > may differ from program (or class, module, etc.) to program. > For example, if I want to put class A's cache on a fast hard disk > with very large size, and program B's cache on a slower hard disk > with small size, what should I do? That is true. OK, leave caching for the modules to resolve. I'll use my own caching of GEO XML objects. > > For (1) David Powers came up with a nice approach for the Cfruby > > project - where modules can override behaviour of the error handling > > (I wanted that for the Cfenjin application). See > > > > http://rubyforge.org/projects/cfruby/ > > > > and the source code at: > > > > http://cfruby.rubyforge.org/svn/lib/libcfruby/flowmonitor.rb > > > > with my usage: > > > > http://cfruby.rubyforge.org/svn/lib/libcfenjin/cfp_logger.rb > > http://cfruby.rubyforge.org/svn/lib/libcfenjin/cfp_flowmonitor.rb > > > > In my case I wanted to override the standard single switch for WARN, > > INFO, DEBUG etc., with a second switch for TRACING, VERBOSITY levels > > and TESTING. For BioRuby it is simpler, as we have (perhaps) have no > > such requirement at the library level. > > I've not seen this yet, but is it different from the Logger class, > a standard bundled class with Ruby? > http://www.ruby-doc.org/stdlib/libdoc/logger/rdoc/classes/Logger.html The difference is that David's version makes use of an observer pattern to allow overriding and enhancing. This allows a program to change behaviour of all (internal) library error handling in a transparent fashion. Ignore it, it is over the top for BioRuby. Note: using the logger class consistently would already be a great improvement. Pj. From davide.rambaldi at ifom-ieo-campus.it Tue Sep 2 10:28:59 2008 From: davide.rambaldi at ifom-ieo-campus.it (Davide Rambaldi) Date: Tue, 2 Sep 2008 12:28:59 +0200 Subject: [BioRuby] Bio::Blat Message-ID: Hi all, I am trying to use Ruby and BioRuby to translate a Perl script that I am using in my lab to parse psl files. The blatanalyzer script should: sort entries according to identity, coverage, score, cut psl files in order to keep only alignments with a given identity, generate report tables (similar to a web blat result table in the UCSC server), convert psl to gff and gtf, etc... USAGE: Usage ./blatanalyzer.rb [options] action file.psl and can be used also in a pipe (cat file.psl | ./blatanalyzer.rb action) I am a newbye of Ruby scripting (and also I am currently trying to understand the conventions used in BioRuby) so I am not sure if my design is decent or completely stupid/crazy. First of all, I need some extra methods not present in Bio::Blat::Report (like coverage, sorting_by, grouping, etc...) so my idea is to made a subclass of Bio::Blat::Report: module Bio class Blat class Analyzer < Report def coverage implementation here ... end end end end Is this a good idea? On the other side I am working on a Bio::Blat::Application that should initialize options (parsed by a OptParser class), load a stream, pass the stream to the Bio::Blat::Analyzer object, choose which method (action) apply to the stream. Is OK to put this code in the Bio::Blat namespace? or I should put it in an external Application class? Actually the structure of my blatanalyzer.rb application is this one class Color # to handle colorized output (use term-ansicolor) end class OptParser # parse command line options end module Bio class Blat class Analyzer < Report # extend the functionality of the Report with sorting, grouping and other methods end class Application # load a stream, check options, select action and execute it printing result on STDOUT end end end # MAIN.APP # slurp command line options and start application options = OptParser.parse(ARGV) Bio::Blat::Application.new(options,ARGF) Something I need to change? make sense? Thanks for your help, any suggestion is really welcome! Davide Rambaldi, Bioinformatics PhD student. ----------------------------------------------------- Bioinformatic Group IFOM-IEO Campus Via Adamello 16, Milano I-20139 Italy [t] +39 02574303 066 [e] davide.rambaldi at ifom-ieo-campus.it [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/DavideRambaldi (homepage) [i] http://www.semm.it (PhD school) [i] http://www.btbs.unimib.it/ (Master) ----------------------------------------------------- From jan.aerts at gmail.com Tue Sep 2 15:46:43 2008 From: jan.aerts at gmail.com (Jan Aerts) Date: Tue, 2 Sep 2008 16:46:43 +0100 Subject: [BioRuby] official announcement move of bioruby from CVS to git Message-ID: <4c7507a70809020846w2fe0cbe4m17545950c0e33b42@mail.gmail.com> All, We can finally tell you that bioruby has officially moved from CVS to git. Development on CVS will be discontinued. Please use the git repository at http://github.com/bioruby/bioruby from now on. How do I get bioruby? ================= Nothing has changed for how to obtain bioruby. At least for the released versions. You can still do a "gem install bio" to get the latest release as we will continue to make the gem available through rubyforge. Alternatively, it will become possible (but not yet) to install the gem directly from github: "gem sources -a http://gems.github.com" followed by a "gem install bioruby-bio". The story is different if you want to get the latest development version. Instead of doing a 'cvs checkout' or 'cvs export' as you used to do, you can clone the online git repository with "git clone git://github.com/bioruby/bioruby.git". The 'cvs update' you used to do should now be changed to a "git pull". How do I contribute to bioruby? ======================== Contributing to bioruby should be much easier with git than it was with CVS. See this blog post (http://saaientist.blogspot.com/2008/06/bioruby-with-git-how-would-that-work.html) for guidelines. Basically, you clone the repository locally and send a patch or a pull request. Moreover, if you use the 'fork' button on the github website, your clone will be on the github system as well and your development can be followed by everyone (see http://github.com/bioruby/bioruby/network), which is a Good Thing(TM). For a guideline on how to format your commit messages nicely, see here: http://www.tpope.net/node/106 Thanks to everyone who cloned the repository and started developing. Keep up the good work. jan. (also for Naohisa Goto) From pjotr2008 at thebird.nl Wed Sep 3 08:07:22 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Wed, 3 Sep 2008 10:07:22 +0200 Subject: [BioRuby] official announcement move of bioruby from CVS to git Message-ID: <20080903080722.GB9055@thebird.nl> > We can finally tell you that bioruby has officially moved from CVS to > git. Development on CVS will be discontinued. Please use the git > repository at http://github.com/bioruby/bioruby from now on. This is great! I must say, the more I use git, the more I like it. This is the version control system I have always wanted (after darcs and Mercurial). It is a tad complex when using more advanced features, but once they work they are stunningly good. And github is also an astounding tool (much of it Ruby based, I gather). Every bioinformatician should make git part of his/her toolbox. Really. Pj. From davide.rambaldi at ifom-ieo-campus.it Wed Sep 3 09:45:05 2008 From: davide.rambaldi at ifom-ieo-campus.it (Davide Rambaldi) Date: Wed, 3 Sep 2008 11:45:05 +0200 Subject: [BioRuby] Bio::Blat::Report Message-ID: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> Hi, after installing the last version from git (http://github.com/ bioruby/bioruby), I have a couple of warnings using my application: NOTE: the file test.psl I am using for testing is without psl headers Oni:~/code/Ruby/bioruby tucano$ ./blatanalyzer list blatanalyzerdir/ test/test.psl /usr/local/lib/ruby/site_ruby/1.8/bio/io/flatfile/splitter.rb:81: warning: private attribute? /usr/local/lib/ruby/site_ruby/1.8/bio/io/flatfile/splitter.rb:84: warning: private attribute? /usr/local/lib/ruby/site_ruby/1.8/bio/io/flatfile/splitter.rb:87: warning: private attribute? /usr/local/lib/ruby/site_ruby/1.8/bio/io/flatfile/splitter.rb:90: warning: private attribute? /usr/local/lib/ruby/site_ruby/1.8/bio/io/flatfile/splitter.rb:93: warning: private attribute? /usr/local/lib/ruby/site_ruby/1.8/bio/io/flatfile/splitter.rb:96: warning: private attribute? /usr/local/lib/ruby/site_ruby/1.8/bio/io/flatfile/splitter.rb:270: warning: private attribute? /usr/local/lib/ruby/site_ruby/1.8/bio/appl/blat/report.rb:89: warning: instance variable @header_lines not initialized The previuos version I was using don't give warnings here the diff of changes in the new git version and in the previous report.rb: diff oldreport.rb /usr/local/lib/ruby/site_ruby/1.8/bio/appl/blat/ report.rb 48a49,51 > # Splitter for Bio::FlatFile > FLATFILE_SPLITTER = Bio::FlatFile::Splitter::LineOriented > 53c56 < def initialize(text) --- > def initialize(text = '') 57c60 < text.each do |line| --- > text.each_line do |line| 74c77,115 < @columns = parse_header(head) --- > @columns = parse_header(head) unless head.empty? > end > > # Adds a header line if the header data is not yet given and > # the given line is suitable for header. > # Returns self if adding header line is succeeded. > # Otherwise, returns false (the line is not added). > def add_header_line(line) > return false if defined? @columns > line = line.chomp > case line > when /^\d/ > @columns = defined? @header_lines ? parse_header (@header_lines) : [] > return false > when /\A\-+\s*\z/ > @columns = defined? @header_lines ? parse_header (@header_lines) : [] > return self > else > @header_lines ||= [] > @header_lines.push line > end > end > > # Adds a line to the entry if the given line is regarded as > # a part of the current entry. > # If the current entry (self) is empty, or the line has the same > # query name, the line is added and returns self. > # Otherwise, returns false (the line is not added). > def add_line(line) > if /\A\s*\z/ =~ line then > return @hits.empty? ? self : false > end > hit = Hit.new(line.chomp) > if @hits.empty? or @hits.first.query.name == hit.query.name then > @hits.push hit > return self > else > return false > end Best Regards Davide Rambaldi, Bioinformatics PhD student. ----------------------------------------------------- Bioinformatic Group IFOM-IEO Campus Via Adamello 16, Milano I-20139 Italy [t] +39 02574303 066 [e] davide.rambaldi at ifom-ieo-campus.it [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/DavideRambaldi (homepage) [i] http://www.semm.it (PhD school) [i] http://www.btbs.unimib.it/ (Master) ----------------------------------------------------- From mail at michaelbarton.me.uk Wed Sep 3 12:27:23 2008 From: mail at michaelbarton.me.uk (Michael Barton) Date: Wed, 3 Sep 2008 13:27:23 +0100 Subject: [BioRuby] official announcement move of bioruby from CVS to git In-Reply-To: <20080903080722.GB9055@thebird.nl> References: <20080903080722.GB9055@thebird.nl> Message-ID: I completely agree with what Pjotr has written. I think moving to git/github is great step for BioRuby, and I hope to see it pay divends in the future for the development of the code base. This is great work by everyone involved to move BioRuby over to git. Mike On Wed, Sep 3, 2008 at 9:07 AM, Pjotr Prins wrote: > > We can finally tell you that bioruby has officially moved from CVS to > > git. Development on CVS will be discontinued. Please use the git > > repository at http://github.com/bioruby/bioruby from now on. > > This is great! I must say, the more I use git, the more I like it. > This is the version control system I have always wanted (after darcs > and Mercurial). It is a tad complex when using more advanced features, > but once they work they are stunningly good. And github is also an > astounding tool (much of it Ruby based, I gather). > > Every bioinformatician should make git part of his/her toolbox. > Really. > > Pj. > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From mail at michaelbarton.me.uk Wed Sep 3 12:32:56 2008 From: mail at michaelbarton.me.uk (Michael Barton) Date: Wed, 3 Sep 2008 13:32:56 +0100 Subject: [BioRuby] Ruby in the minority for bioinformaticians. Message-ID: I recently ran a survey of bioinformaticians which included which programming language do you use. The results will be somewhat biased to people who read blogs etcetera, but does show that Ruby has had a somewhat small uptake in the bioinformatics community. The results can be found here (loads slowly at the moment). http://openwetware.org/wiki/Biogang:Projects/Bioinformatics_Career_Survey_2008_Results Mike From ngoto at gen-info.osaka-u.ac.jp Wed Sep 3 13:34:28 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 3 Sep 2008 22:34:28 +0900 Subject: [BioRuby] Bio::Blat::Report In-Reply-To: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> Message-ID: <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> Hi, Thanks for reporting bugs. On Wed, 3 Sep 2008 11:45:05 +0200 Davide Rambaldi wrote: > Hi, after installing the last version from git (http://github.com/ > bioruby/bioruby), I have a couple of warnings using my application: > > NOTE: the file test.psl I am using for testing is without psl headers > > Oni:~/code/Ruby/bioruby tucano$ ./blatanalyzer list blatanalyzerdir/ > test/test.psl > /usr/local/lib/ruby/site_ruby/1.8/bio/io/flatfile/splitter.rb:81: > warning: private attribute? (snip) > /usr/local/lib/ruby/site_ruby/1.8/bio/appl/blat/report.rb:89: > warning: instance variable @header_lines not initialized The warning message "warning: instance variable @header_lines not initialized" was a bug during header parsing. The messages "warning: private attribute?" are harmless now, but I've changed not to be shown by explicitly specifying private attributes using "private". I've just fixed them in git repository. http://github.com/bioruby/bioruby/commit/3ff940988b76bdff75679cdf0af4c836f76fa3a1 http://github.com/bioruby/bioruby/commit/1440b766202a2b66ac7386b9b46928834a9c9873 Could you please try again with new version? FYI: When reporting, please show which Ruby version, OS, and architecture (type of CPU) you are using, with BioRuby version. In addition, please show a short script and test data to reproduce the bug, or please show all your scripts and data (If very large, put them to your homepage or blog). Note that in this case, I can find problem without these information, and you don't need to do so unless the bugs are not fixed well. > The previuos version I was using don't give warnings > > here the diff of changes in the new git version and in the previous > report.rb: > > diff oldreport.rb /usr/local/lib/ruby/site_ruby/1.8/bio/appl/blat/ > report.rb Please don't show diffs between already committed versions, except when you can clearly point out what is wrong. Normally, to see diffs with commit messages, doing % git log -p lib/bio/appl/blat/report.rb is enough. Thanks, Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From davide.rambaldi at ifom-ieo-campus.it Wed Sep 3 14:30:42 2008 From: davide.rambaldi at ifom-ieo-campus.it (Davide Rambaldi) Date: Wed, 3 Sep 2008 16:30:42 +0200 Subject: [BioRuby] Bio::Blat::Report In-Reply-To: <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <8B0629C9-0DC0-4DBD-BC46-CB2A5D7BF1FE@ifom-ieo-campus.it> > > Could you please try again with new version? > I've just fixed them in git repository. > http://github.com/bioruby/bioruby/commit/ > 3ff940988b76bdff75679cdf0af4c836f76fa3a1 > http://github.com/bioruby/bioruby/commit/ > 1440b766202a2b66ac7386b9b46928834a9c9873 > It's ok now. Thanks I still have 3 errors in testing the last version (just to report it...) 1) Failure: test_gff_exportview(Bio::FuncTestEnsemblHuman) [./test/functional/bio/ io/test_ensembl.rb:95]: <"4\tEnsembl\tGene\t1148366\t1151952\t.\t+\t1 \tgene_id=ENSG00000206158; transcript_id=ENST00000382964; exon_id=ENSE00001494097; gene_type=KNOWN_protein_coding\n"> expected but was <"">. 2) Failure: test_gff_exportview_with_named_args(Bio::FuncTestEnsemblHuman) [./ test/functional/bio/io/test_ensembl.rb:121]: <"4\tEnsembl\tGene\t1148366\t1151952\t.\t+\t1 \tgene_id=ENSG00000206158; transcript_id=ENST00000382964; exon_id=ENSE00001494097; gene_type=KNOWN_protein_coding\n"> expected but was <"">. 3) Failure: test_tab_exportview_with_named_args(Bio::FuncTestEnsemblHuman) [./ test/functional/bio/io/test_ensembl.rb:180]: <"seqname\tsource\tfeature\tstart\tend\tscore\tstrand\tframe\tgene_id \ttranscript_id\texon_id\tgene_type\n4\tEnsembl\tGene\t1148366 \t1151952\t.\t+\t1\tENSG00000206158\tENST00000382964\tENSE00001494097 \tKNOWN_protein_coding\n"> expected but was <"seqname\tsource\tfeature\tstart\tend\tscore\tstrand\tframe\tgene_id \ttranscript_id\texon_id\tgene_type\n">. Thanks again > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby Davide Rambaldi, Bioinformatics PhD student. ----------------------------------------------------- Bioinformatic Group IFOM-IEO Campus Via Adamello 16, Milano I-20139 Italy [t] +39 02574303 066 [e] davide.rambaldi at ifom-ieo-campus.it [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/DavideRambaldi (homepage) [i] http://www.semm.it (PhD school) [i] http://www.btbs.unimib.it/ (Master) ----------------------------------------------------- From ngoto at gen-info.osaka-u.ac.jp Wed Sep 3 14:31:43 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 3 Sep 2008 23:31:43 +0900 Subject: [BioRuby] Bio::Blat In-Reply-To: References: Message-ID: <20080903143144.3EE481CBC4CB@idnmail.gen-info.osaka-u.ac.jp> On Tue, 2 Sep 2008 12:28:59 +0200 Davide Rambaldi wrote: > Hi all, I am trying to use Ruby and BioRuby to translate a Perl > script that I am using in my lab to parse psl files. > > The blatanalyzer script should: > > sort entries according to identity, coverage, score, cut psl files in > order to keep only alignments with a given identity, > generate report tables (similar to a web blat result table in the > UCSC server), convert psl to gff and gtf, etc... > > USAGE: > > Usage ./blatanalyzer.rb [options] action file.psl > > and can be used also in a pipe (cat file.psl | ./blatanalyzer.rb action) > > > I am a newbye of Ruby scripting (and also I am currently trying to > understand the conventions used in BioRuby) so I am not sure if my > design is decent or completely stupid/crazy. > > First of all, I need some extra methods not present in > Bio::Blat::Report (like coverage, sorting_by, grouping, etc...) so > my idea is to made a subclass of Bio::Blat::Report: > > module Bio > class Blat > class Analyzer < Report > def coverage > implementation here ... > end > end > end > end > > Is this a good idea? In Ruby, a class that inherits existing class can be affected by internal changes of the existing class, including conflicts of private method names and instance variable names. If you can follow changes of the ancestral class and can change your code following the ancestral changes, to create subclass may be the most efficient way, from the viewpoint of running speed, memory efficiency, and code size. If you don't want to do so, and/or the internal structure of the ancestral class isn't clear, it is safe to store as an object, without inheritance. Note that this is only from practical point of view, as I don't know so much about the philosophy of OOP. > On the other side I am working on a Bio::Blat::Application that > should initialize options (parsed by a OptParser class), load a > stream, pass the stream to the Bio::Blat::Analyzer object, choose > which method (action) apply to the stream. > > Is OK to put this code in the Bio::Blat namespace? or I should put it > in an external Application class? In your application, you can do whatever you like. However, I think using your original namespace would be better to avoid confusion, especially when errors occur. In addition, be careful when using the mod_ruby apache module. Because mod_ruby shares Ruby interpreters among different scripts, modifying existing class/module in mod_ruby is not recommended unless you understand what you are doing. > > Actually the structure of my blatanalyzer.rb application is this one > > class Color > # to handle colorized output (use term-ansicolor) > end > > class OptParser > # parse command line options > end > > module Bio > class Blat > > class Analyzer < Report > # extend the functionality of the Report with sorting, > grouping and other methods > end > > class Application > # load a stream, check options, select action and execute it > printing result on STDOUT > end > > end > end > > # MAIN.APP > # slurp command line options and start application > options = OptParser.parse(ARGV) > Bio::Blat::Application.new(options,ARGF) > > > Something I need to change? make sense? In your application, you can do whatever you want to do. What I write here is only an empirical suggestion. -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From davide.rambaldi at ifom-ieo-campus.it Wed Sep 3 15:48:07 2008 From: davide.rambaldi at ifom-ieo-campus.it (Davide Rambaldi) Date: Wed, 3 Sep 2008 17:48:07 +0200 Subject: [BioRuby] Bio::Blat::Report In-Reply-To: <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> Hi again sorry for all this e-mails, I notice a change in the reporter object (add_line method) after commit: http://github.com/bioruby/bioruby/commit/ 88b2fb24dddcd2d5d0715e8274eda1b1ebac0abd + # Adds a line to the entry if the given line is regarded as + # a part of the current entry. + # If the current entry (self) is empty, or the line has the same + # query name, the line is added and returns self. + # Otherwise, returns false (the line is not added). + def add_line(line) + if /\A\s*\z/ =~ line then + return @hits.empty? ? self : false + end + hit = Hit.new(line.chomp) + if @hits.empty? or @hits.first.query.name == hit.query.name then + @hits.push hit + return self + else + return false + end end So now if there are more than one query_id in the input file it will be automatically splitted in different reports right? That's cool (I have developed a method in my blat analyzer to group hits by id that I can remove now). the only point I see: what append with an input with line swapped? I don't believe is a common case anyway: blat psl results are ordered by query name but can happend if you change the order of psl lines. consider this script: #!/usr/local/bin/ruby -w require 'bio' Bio::FlatFile.open(Bio::Blat::Report,ARGF).each do |report| puts "object id: " + report.object_id.to_s + " hits: " + report.hits.size.to_s + " query name:" + report.query_id end Before the commit it give only one object, and (as reported in doc) only the first query name. now with this test file: -------------- next part -------------- 3 lines of psl output with 3 different query name: output: object id: 277400 hits: 1 query name:query1 object id: 274620 hits: 1 query name:query2 object id: 271910 hits: 1 query name:query3 But if with a psl file like this one: -------------- next part -------------- Where we have 3 query names (2 hits each) and lines are not in order: object id: 277400 hits: 1 query name:query1 object id: 274620 hits: 1 query name:query2 object id: 272010 hits: 1 query name:query1 object id: 269350 hits: 1 query name:query3 object id: 266640 hits: 1 query name:query2 object id: 263930 hits: 1 query name:query3 f I sort the lines again by query name: -------------- next part -------------- object id: 277400 hits: 2 query name:query1 object id: 273590 hits: 2 query name:query2 object id: 269800 hits: 2 query name:query3 So it doesn't work if you have unsorted lines (but I guess is faster). Sorry for my bad english and for this long mail. best regards Davide Rambaldi, Bioinformatics PhD student. ----------------------------------------------------- Bioinformatic Group IFOM-IEO Campus Via Adamello 16, Milano I-20139 Italy [t] +39 02574303 066 [e] davide.rambaldi at ifom-ieo-campus.it [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/DavideRambaldi (homepage) [i] http://www.semm.it (PhD school) [i] http://www.btbs.unimib.it/ (Master) ----------------------------------------------------- From ngoto at gen-info.osaka-u.ac.jp Thu Sep 4 03:52:56 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 4 Sep 2008 12:52:56 +0900 Subject: [BioRuby] Bio::Blat::Report In-Reply-To: <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> Message-ID: <20080904035257.5B6EB1CBC5D9@idnmail.gen-info.osaka-u.ac.jp> On Wed, 3 Sep 2008 17:48:07 +0200 Davide Rambaldi wrote: > Hi again sorry for all this e-mails, > > I notice a change in the reporter object (add_line method) after commit: > http://github.com/bioruby/bioruby/commit/ > 88b2fb24dddcd2d5d0715e8274eda1b1ebac0abd > > + # Adds a line to the entry if the given line is regarded as > + # a part of the current entry. > + # If the current entry (self) is empty, or the line has the same > + # query name, the line is added and returns self. > + # Otherwise, returns false (the line is not added). > + def add_line(line) > + if /\A\s*\z/ =~ line then > + return @hits.empty? ? self : false > + end > + hit = Hit.new(line.chomp) > + if @hits.empty? or @hits.first.query.name == hit.query.name > then > + @hits.push hit > + return self > + else > + return false > + end > end > > > So now if there are more than one query_id in the input file it will > be automatically splitted in different reports right? Yes, in combination with Bio::FlatFile. The behavior was changed after this commit: http://github.com/bioruby/bioruby/commit/88b2fb24dddcd2d5d0715e8274eda1b1ebac0abd This is somehow incompatible, but good at speed and memory usage. In addition, some people requested. http://lists.open-bio.org/pipermail/bioruby-ja/2007-August/000137.html (Mailing list written in Japanese) Note that this can make mistake for data contiguously containing different query sequences with the same name. > That's cool (I have developed a method in my blat analyzer to group > hits by id that I can remove now). > > the only point I see: what append with an input with line swapped? > I don't believe is a common case anyway: blat psl results are ordered > by query name > but can happend if you change the order of psl lines. When the parser detects change of query entry name, the report object will be changed to new one. Note that the Bio::Blat::Report parser only supports files directly generated by the blat program, without post-modification. What happened with modified data is your own risk. > consider this script: > > #!/usr/local/bin/ruby -w > require 'bio' > > Bio::FlatFile.open(Bio::Blat::Report,ARGF).each do |report| > puts "object id: " + report.object_id.to_s + " hits: " + > report.hits.size.to_s + " query name:" + report.query_id > end > > Before the commit it give only one object, and (as reported in doc) > only the first query name. > > now with this test file: If you really want old bahavior, str = File.read(filename) obj = Bio::Blat::Report.new(str) the obj is a single Bio::Blat::Report object with possible multiple queries. -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From davide.rambaldi at ifom-ieo-campus.it Thu Sep 4 09:11:54 2008 From: davide.rambaldi at ifom-ieo-campus.it (Davide Rambaldi) Date: Thu, 4 Sep 2008 11:11:54 +0200 Subject: [BioRuby] Bio::Blat::Report In-Reply-To: <20080904035257.5B6EB1CBC5D9@idnmail.gen-info.osaka-u.ac.jp> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> <20080904035257.5B6EB1CBC5D9@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> On Sep 4, 2008, at 5:52 AM, Naohisa GOTO wrote: > This is somehow incompatible, but good at speed and memory usage. > In addition, some people requested. > http://lists.open-bio.org/pipermail/bioruby-ja/2007-August/000137.html > (Mailing list written in Japanese) ehm... any good translator from japanese to english (or better italian!) ? :P anyway I am agree that the strange case of mixed hits can be ignored. This commits will be available in the next version of bioruby? I have bioruby on the edge in my laptop but not on the cluster... Last question (sorry for asking everything), there is a way to generate docs of boiruby that can be queried with the ri command? ri Bio::Blat::Report Nothing known about Bio::Blat::Report Thanks! Davide Rambaldi, Bioinformatics PhD student. ----------------------------------------------------- Bioinformatic Group IFOM-IEO Campus Via Adamello 16, Milano I-20139 Italy [t] +39 02574303 066 [e] davide.rambaldi at ifom-ieo-campus.it [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/DavideRambaldi (homepage) [i] http://www.semm.it (PhD school) [i] http://www.btbs.unimib.it/ (Master) ----------------------------------------------------- From ngoto at gen-info.osaka-u.ac.jp Thu Sep 4 11:36:28 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 4 Sep 2008 20:36:28 +0900 Subject: [BioRuby] Bio::Blat::Report In-Reply-To: <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> <20080904035257.5B6EB1CBC5D9@idnmail.gen-info.osaka-u.ac.jp> <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> Message-ID: <20080904113629.834C61CBC5D5@idnmail.gen-info.osaka-u.ac.jp> On Thu, 4 Sep 2008 11:11:54 +0200 Davide Rambaldi wrote: > > On Sep 4, 2008, at 5:52 AM, Naohisa GOTO wrote: > > > This is somehow incompatible, but good at speed and memory usage. > > In addition, some people requested. > > http://lists.open-bio.org/pipermail/bioruby-ja/2007-August/000137.html > > (Mailing list written in Japanese) > > > ehm... any good translator from japanese to english (or better > italian!) ? :P Google or Yahoo can be used. Be careful they frequently mistranslate. http://www.google.com/translate_t http://babelfish.yahoo.com/ > anyway I am agree that the strange case of mixed hits can be ignored. > > This commits will be available in the next version of bioruby? Yes. > > I have bioruby on the edge in my laptop but not on the cluster... > > Last question (sorry for asking everything), there is a way to > generate docs of boiruby that can be queried with the ri command? > > ri Bio::Blat::Report > Nothing known about Bio::Blat::Report I don't know about ri, and I hope someone can answer. -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From kpatil at science.uva.nl Thu Sep 4 12:02:19 2008 From: kpatil at science.uva.nl (K. Patil) Date: Thu, 4 Sep 2008 14:02:19 +0200 (CEST) Subject: [BioRuby] problem while handling large fasta files In-Reply-To: <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> <20080904035257.5B6EB1CBC5D9@idnmail.gen-info.osaka-u.ac.jp> <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> Message-ID: <26079.84.59.22.23.1220529739.squirrel@webmail.science.uva.nl> Hi, I am trying to do some simple processing on fasta files. It works file for small files (upto several MB). But as soon as I move to very large files (e.g. 2.2 GB) the program crashes. Any help/suggestions highly appreciated. Best regards, Kaustubh Patil I am pasting a very simple example below (the file is 2.2GB); irb(main):021:0> fasta = Bio::FastaFormat.open("9606.2.fna") => #, @buffer="", @path="9606.2.fna">, @buffer=">9606.2.fna\ntaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac\n", @path="9606.2.fna">, @header=nil, @delimiter="\n>", @delimiter_overrun=1>, @firsttime_flag=true, @stream=#, @buffer="", @path="9606.2.fna">, @buffer=">9606.2.fna\ntaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac\n", @path="9606.2.fna">, @skip_leader_mode=:firsttime, @raw=false, @dbclass=Bio::FastaFormat> irb(main):022:0> fasta.each do |seq| irb(main):023:1* print seq.data irb(main):024:1> end NoMethodError: private method `sub' called for nil:NilClass from /usr/lib/ruby/1.8/bio/db/fasta.rb:156:in `initialize' from /usr/lib/ruby/1.8/bio/io/flatfile.rb:579:in `new' from /usr/lib/ruby/1.8/bio/io/flatfile.rb:579:in `next_entry' from /usr/lib/ruby/1.8/bio/io/flatfile.rb:609:in `each' from (irb):22 From ngoto at gen-info.osaka-u.ac.jp Thu Sep 4 13:01:59 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 4 Sep 2008 22:01:59 +0900 Subject: [BioRuby] problem while handling large fasta files In-Reply-To: <26079.84.59.22.23.1220529739.squirrel@webmail.science.uva.nl> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> <20080904035257.5B6EB1CBC5D9@idnmail.gen-info.osaka-u.ac.jp> <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> <26079.84.59.22.23.1220529739.squirrel@webmail.science.uva.nl> Message-ID: <20080904130200.C30ED1CBC5DC@idnmail.gen-info.osaka-u.ac.jp> Hi, Please show which BioRuby version, Ruby version, OS, architecture (type of CPU) you are using. Is the Ruby and/or BioRuby version older? Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Thu, 4 Sep 2008 14:02:19 +0200 (CEST) "K. Patil" wrote: > Hi, > > I am trying to do some simple processing on fasta files. It works file for > small files (upto several MB). But as soon as I move to very large files > (e.g. 2.2 GB) the program crashes. Any help/suggestions highly > appreciated. > > Best regards, > Kaustubh Patil > > I am pasting a very simple example below (the file is 2.2GB); > > irb(main):021:0> fasta = Bio::FastaFormat.open("9606.2.fna") > => # @splitter=# @stream=# @io=# @io=#, @buffer="", @path="9606.2.fna">, > @buffer=">9606.2.fna\ntaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac\n", > @path="9606.2.fna">, @header=nil, @delimiter="\n>", @delimiter_overrun=1>, > @firsttime_flag=true, > @stream=# @io=# @io=#, @buffer="", @path="9606.2.fna">, > @buffer=">9606.2.fna\ntaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac\n", > @path="9606.2.fna">, @skip_leader_mode=:firsttime, @raw=false, > @dbclass=Bio::FastaFormat> > irb(main):022:0> fasta.each do |seq| > irb(main):023:1* print seq.data > irb(main):024:1> end > NoMethodError: private method `sub' called for nil:NilClass > from /usr/lib/ruby/1.8/bio/db/fasta.rb:156:in `initialize' > from /usr/lib/ruby/1.8/bio/io/flatfile.rb:579:in `new' > from /usr/lib/ruby/1.8/bio/io/flatfile.rb:579:in `next_entry' > from /usr/lib/ruby/1.8/bio/io/flatfile.rb:609:in `each' > from (irb):22 > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From kpatil at science.uva.nl Thu Sep 4 13:32:27 2008 From: kpatil at science.uva.nl (K. Patil) Date: Thu, 4 Sep 2008 15:32:27 +0200 (CEST) Subject: [BioRuby] problem while handling large fasta files In-Reply-To: <20080904130200.C30ED1CBC5DC@idnmail.gen-info.osaka-u.ac.jp> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> <20080904035257.5B6EB1CBC5D9@idnmail.gen-info.osaka-u.ac.jp> <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> <26079.84.59.22.23.1220529739.squirrel@webmail.science.uva.nl> <20080904130200.C30ED1CBC5DC@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <27147.84.59.22.23.1220535147.squirrel@webmail.science.uva.nl> Oops, sorry for incomplete information. Here it is; Ruby: 1.8 Bioruby: 1.0.0 OS/CPU: 2.6.24.2.1.amd64-smp #1 SMP Mon Feb 11 12:43:21 UTC 2008 x86_64 GNU/Linux Also I cannot upgrade Ruby/Bioruby easily as I don't have appropriate permissions (all packages are installed by the administrator on request). thanks and regards, kaustubh > Hi, > > Please show which BioRuby version, Ruby version, OS, > architecture (type of CPU) you are using. > > Is the Ruby and/or BioRuby version older? > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > On Thu, 4 Sep 2008 14:02:19 +0200 (CEST) > "K. Patil" wrote: > >> Hi, >> >> I am trying to do some simple processing on fasta files. It works file >> for >> small files (upto several MB). But as soon as I move to very large files >> (e.g. 2.2 GB) the program crashes. Any help/suggestions highly >> appreciated. >> >> Best regards, >> Kaustubh Patil >> >> I am pasting a very simple example below (the file is 2.2GB); >> >> irb(main):021:0> fasta = Bio::FastaFormat.open("9606.2.fna") >> => #> @splitter=#> @stream=#> @io=#> @io=#, @buffer="", @path="9606.2.fna">, >> @buffer=">9606.2.fna\ntaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac\n", >> @path="9606.2.fna">, @header=nil, @delimiter="\n>", >> @delimiter_overrun=1>, >> @firsttime_flag=true, >> @stream=#> @io=#> @io=#, @buffer="", @path="9606.2.fna">, >> @buffer=">9606.2.fna\ntaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac\n", >> @path="9606.2.fna">, @skip_leader_mode=:firsttime, @raw=false, >> @dbclass=Bio::FastaFormat> >> irb(main):022:0> fasta.each do |seq| >> irb(main):023:1* print seq.data >> irb(main):024:1> end >> NoMethodError: private method `sub' called for nil:NilClass >> from /usr/lib/ruby/1.8/bio/db/fasta.rb:156:in `initialize' >> from /usr/lib/ruby/1.8/bio/io/flatfile.rb:579:in `new' >> from /usr/lib/ruby/1.8/bio/io/flatfile.rb:579:in `next_entry' >> from /usr/lib/ruby/1.8/bio/io/flatfile.rb:609:in `each' >> from (irb):22 >> >> >> _______________________________________________ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > From sgujja at broad.mit.edu Thu Sep 4 14:53:11 2008 From: sgujja at broad.mit.edu (Sharvari Gujja) Date: Thu, 04 Sep 2008 10:53:11 -0400 Subject: [BioRuby] Multi Fasta Sequence file to Genbank conversion.. Message-ID: <48BFF657.1080302@broad.mit.edu> Hi, I am trying to convert a multi fasta sequence file (nucleotide/protein) to genbank format.Is there a way to do this using Bioruby? Appreciate any input/suggestions. Thanks S From adamnkraut at gmail.com Thu Sep 4 23:13:25 2008 From: adamnkraut at gmail.com (Adam Kraut) Date: Thu, 4 Sep 2008 19:13:25 -0400 Subject: [BioRuby] Multi Fasta Sequence file to Genbank conversion.. In-Reply-To: <48BFF657.1080302@broad.mit.edu> References: <48BFF657.1080302@broad.mit.edu> Message-ID: <134ede0b0809041613o4cd536ddwa183f88f377e108b@mail.gmail.com> I've never used the genbank format, but in Bioruby you could try: include Bio fasta = Alignment::MultiFastaFormat.new(File.open('my.fasta').read) fasta.entries.each do |seq| puts seq.to_seq.output(:genbank) end The only tricky part is perhaps is the to_seq call for a Bio::Sequence object which has different output format methods. -Adam On Thu, Sep 4, 2008 at 10:53 AM, Sharvari Gujja wrote: > Hi, > > I am trying to convert a multi fasta sequence file (nucleotide/protein) to > genbank format.Is there a way to do this using Bioruby? > > Appreciate any input/suggestions. > > Thanks > S > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From ngoto at gen-info.osaka-u.ac.jp Fri Sep 5 01:34:26 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Fri, 5 Sep 2008 10:34:26 +0900 Subject: [BioRuby] Multi Fasta Sequence file to Genbank conversion.. In-Reply-To: <134ede0b0809041613o4cd536ddwa183f88f377e108b@mail.gmail.com> References: <48BFF657.1080302@broad.mit.edu> <134ede0b0809041613o4cd536ddwa183f88f377e108b@mail.gmail.com> Message-ID: <20080905013427.D6BEE1CBC5EF@idnmail.gen-info.osaka-u.ac.jp> Hi, On Thu, 4 Sep 2008 19:13:25 -0400 "Adam Kraut" wrote: > I've never used the genbank format, but in Bioruby you could try: > > include Bio > > fasta = Alignment::MultiFastaFormat.new(File.open('my.fasta').read) > fasta.entries.each do |seq| > puts seq.to_seq.output(:genbank) > end No need to use Bio::Alignment::MultiFastaFormat in this case. Bio::FlatFile alone can do. For example, to read from stdin and output to stdout, require 'bio' Bio::FlatFile.open($<) do |ff| ff.each do |e| print e.to_biosequence.output(:genbank) end end Note that the output(:genbank) are new feature only in the latest development version in the git repository. http://github.com/bioruby/bioruby (i.e. in BioRuby 1.2.1, above examples cannot be run.) Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > The only tricky part is perhaps is the to_seq call for a Bio::Sequence > object which has different output format methods. > -Adam > > On Thu, Sep 4, 2008 at 10:53 AM, Sharvari Gujja wrote: > > > Hi, > > > > I am trying to convert a multi fasta sequence file (nucleotide/protein) to > > genbank format.Is there a way to do this using Bioruby? > > > > Appreciate any input/suggestions. > > > > Thanks > > S > > _______________________________________________ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From ngoto at gen-info.osaka-u.ac.jp Fri Sep 5 01:47:21 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Fri, 5 Sep 2008 10:47:21 +0900 Subject: [BioRuby] problem while handling large fasta files In-Reply-To: <27147.84.59.22.23.1220535147.squirrel@webmail.science.uva.nl> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> <20080904035257.5B6EB1CBC5D9@idnmail.gen-info.osaka-u.ac.jp> <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> <26079.84.59.22.23.1220529739.squirrel@webmail.science.uva.nl> <20080904130200.C30ED1CBC5DC@idnmail.gen-info.osaka-u.ac.jp> <27147.84.59.22.23.1220535147.squirrel@webmail.science.uva.nl> Message-ID: <20080905014722.C0BFE1CBC5FE@idnmail.gen-info.osaka-u.ac.jp> On Thu, 4 Sep 2008 15:32:27 +0200 (CEST) "K. Patil" wrote: > Oops, sorry for incomplete information. Here it is; > > Ruby: 1.8 > Bioruby: 1.0.0 > OS/CPU: 2.6.24.2.1.amd64-smp #1 SMP Mon Feb 11 12:43:21 UTC 2008 x86_64 > GNU/Linux The BioRuby 1.0.0 is too old! The only thing I can say is the problem may not occur in the latest version of BioRuby, at least 1.2.1. > Also I cannot upgrade Ruby/Bioruby easily as I don't have appropriate > permissions (all packages are installed by the administrator on request). BioRuby (and also Ruby) can be installed in your home directory, without root (administrator) permission. The simplest way is: % cd somewhere % wget http://bioruby.open-bio.org/archive/bioruby-1.2.1.tar.gz % tar zxvf bioruby-1.2.1.tar.gz And then, when running your script, % ruby -I /full/path/to/somewhere/bioruby-1.2.1/lib example.rb (The "/full/path/to/somewhere" is the path you extracted the bioruby archive.) If you want to use irb, % ruby -I /full/path/to/somewhere/bioruby-1.2.1/lib -r bio Alternatively, put $LOAD_PATH.unshift("/full/path/to/somewhere/bioruby-1.2.1/lib") before the require 'bio' in your script. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > thanks and regards, > kaustubh > > > > Hi, > > > > Please show which BioRuby version, Ruby version, OS, > > architecture (type of CPU) you are using. > > > > Is the Ruby and/or BioRuby version older? > > > > Naohisa Goto > > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > > > On Thu, 4 Sep 2008 14:02:19 +0200 (CEST) > > "K. Patil" wrote: > > > >> Hi, > >> > >> I am trying to do some simple processing on fasta files. It works file > >> for > >> small files (upto several MB). But as soon as I move to very large files > >> (e.g. 2.2 GB) the program crashes. Any help/suggestions highly > >> appreciated. > >> > >> Best regards, > >> Kaustubh Patil > >> > >> I am pasting a very simple example below (the file is 2.2GB); > >> > >> irb(main):021:0> fasta = Bio::FastaFormat.open("9606.2.fna") > >> => # >> @splitter=# >> @stream=# >> @io=# >> @io=#, @buffer="", @path="9606.2.fna">, > >> @buffer=">9606.2.fna\ntaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac\n", > >> @path="9606.2.fna">, @header=nil, @delimiter="\n>", > >> @delimiter_overrun=1>, > >> @firsttime_flag=true, > >> @stream=# >> @io=# >> @io=#, @buffer="", @path="9606.2.fna">, > >> @buffer=">9606.2.fna\ntaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac\n", > >> @path="9606.2.fna">, @skip_leader_mode=:firsttime, @raw=false, > >> @dbclass=Bio::FastaFormat> > >> irb(main):022:0> fasta.each do |seq| > >> irb(main):023:1* print seq.data > >> irb(main):024:1> end > >> NoMethodError: private method `sub' called for nil:NilClass > >> from /usr/lib/ruby/1.8/bio/db/fasta.rb:156:in `initialize' > >> from /usr/lib/ruby/1.8/bio/io/flatfile.rb:579:in `new' > >> from /usr/lib/ruby/1.8/bio/io/flatfile.rb:579:in `next_entry' > >> from /usr/lib/ruby/1.8/bio/io/flatfile.rb:609:in `each' > >> from (irb):22 > >> > >> > >> _______________________________________________ > >> BioRuby mailing list > >> BioRuby at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > > > > > > From tomoakin at kenroku.kanazawa-u.ac.jp Fri Sep 5 09:21:02 2008 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Fri, 5 Sep 2008 18:21:02 +0900 Subject: [BioRuby] GFF attributes Message-ID: Hi, When extracting attributes from a GFF file, older implementation seem to have eat the last character before ";". Current, (downloaded very recently from github), does not split well, as the regular expression search the largest match. A patch is included, but I am not sure on the specification. http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml The specification says: > From version 2 onwards, the attribute field must have an tag value > structure following the syntax used within objects in a .ace file, > flattened onto one line by semicolon separators. Tags must be > standard identifiers ([A-Za-z][A-Za-z0-9_]*). Free text values must > be quoted with double quotes. Note: all non-printing characters in > such free text value strings (e.g. newlines, tabs, control > characters, etc) must be explicitly represented by their C (UNIX) > style backslash-escaped representation (e.g. newlines as '\n', tabs > as '\t'). So, it seems that for proper parsing, quotation with double quote should be checked for free text, and semicolon in that quatation is not a separator for attributes and semicolon may not be preceeded with back slash. Anyway, the file I am looking now is not that complex, and I will go with a quick hack at this time. Best regards, Tomoaki the test program $ cat test-gff.rb #!/usr/local/bin/ruby require 'bio' gff_str = "LG_I\tJGI\tCDS\t11052\t11064\t.\t-\t0\tname \"grail3.0116000101\"; proteinId 639579; exonNumber 3\n" Bio::GFF.new(gff_str).records.each do |fr| p fr end output after patch $ /usr/local/bin/ruby test-gff.rb #"\"grail3.0116000101\"", "proteinId"=>"639579", "exonNumber"=>"3"}, @end="11064", @seqname="LG_I"> output from current #"\"grail3.0116000101\"; proteinId 639579", "exonNumber"=>"3"}, @end="11064", @seqname="LG_I"> older output #"\"grail3.0116000101", "proteinId"=>"63957", "exonNumber"=>"3"}> diff -ur bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ bio/db/gff.rb bioruby-a/lib/bio/db/gff.rb --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ db/gff.rb 2008-09-03 22:24:39.000000000 +0900 +++ bioruby-a/lib/bio/db/gff.rb 2008-09-05 14:56:50.000000000 +0900 @@ -122,7 +122,7 @@ def parse_attributes(attributes) hash = Hash.new scanner = StringScanner.new(attributes) - while scanner.scan(/(.*[^\\])\;/) or scanner.scan(/(.+)/) + while scanner.scan(/(([^;]|\\;)*[^\\])\;/) or scanner.scan(/ (.+)/) key, value = scanner[1].split(' ', 2) key.strip! value.strip! if value -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan From tomoakin at kenroku.kanazawa-u.ac.jp Fri Sep 5 09:25:30 2008 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Fri, 5 Sep 2008 18:25:30 +0900 Subject: [BioRuby] Bio::Blat::Report In-Reply-To: <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> <20080904035257.5B6EB1CBC5D9@idnmail.gen-info.osaka-u.ac.jp> <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> Message-ID: Hi, > ehm... any good translator from japanese to english (or better > italian!) ? :P Here is a translation by the original sender: -- start of translation I am Nishiyama at Kanazawa. When a multifasta file is used as queries, unlike blast, blat does not output a header, but instead outputs the query and target id in each line. Bio::Blat::Report, in accordance with that behavior, seems to return one entry with many hits. However, as a user, searching with a split file for each query is undesired, while the results is desired to be aggregated for each query. For example when you want the best hit location for each query. Although, there is no separator in the output of blat, the result for the same query comes continuously. When processing as a FlatFile, it would be useful to return a block with the same query name as an "entry", I made "flatfile_splitter". Because each line is parsed for determination of split positioin, return value were made as an Array of Hit, so that Hit.new need not be called again. (For the speed this would about 20% difference.) When processing a psl file of 100-200 Mbytes, more than several Gbytes of memory were required with a system reading the whole data into a Hash and processing the hits for each query, but with this system much smaller memory is sufficient. What do you think? -- end of translation The remainder are the diff of the source code. Note that the name of class and file are changed to avoid collision and the behavior of the original class is not changed. On 2008/09/04, at 18:11, Davide Rambaldi wrote: > > On Sep 4, 2008, at 5:52 AM, Naohisa GOTO wrote: > >> This is somehow incompatible, but good at speed and memory usage. >> In addition, some people requested. >> http://lists.open-bio.org/pipermail/bioruby-ja/2007-August/ >> 000137.html >> (Mailing list written in Japanese) > > > ehm... any good translator from japanese to english (or better > italian!) ? :P > > anyway I am agree that the strange case of mixed hits can be ignored. > > This commits will be available in the next version of bioruby? > > I have bioruby on the edge in my laptop but not on the cluster... > > Last question (sorry for asking everything), there is a way to > generate docs of boiruby that can be queried with the ri command? > > ri Bio::Blat::Report > Nothing known about Bio::Blat::Report > > > Thanks! > > Davide Rambaldi, > Bioinformatics PhD student. > ----------------------------------------------------- > Bioinformatic Group IFOM-IEO Campus > Via Adamello 16, Milano > I-20139 Italy > > [t] +39 02574303 066 > [e] davide.rambaldi at ifom-ieo-campus.it > [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/ > DavideRambaldi (homepage) > [i] http://www.semm.it (PhD school) > [i] http://www.btbs.unimib.it/ (Master) > > ----------------------------------------------------- > > > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan From davide.rambaldi at ifom-ieo-campus.it Fri Sep 5 10:47:06 2008 From: davide.rambaldi at ifom-ieo-campus.it (Davide Rambaldi) Date: Fri, 5 Sep 2008 12:47:06 +0200 Subject: [BioRuby] Bio::Blat::Report In-Reply-To: References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> <20080904035257.5B6EB1CBC5D9@idnmail.gen-info.osaka-u.ac.jp> <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> Message-ID: <17FB5E9F-4D32-4F75-89EC-FF1E0BE1A24F@ifom-ieo-campus.it> On Sep 5, 2008, at 11:25 AM, Tomoaki NISHIYAMA wrote: > Hi, > >> ehm... any good translator from japanese to english (or better >> italian!) ? :P > > Here is a translation by the original sender: > dear Nishiyama thanks for translation to follow the discussion: I am agree, the splitter work well and is fast (create an hash can be a problem with big files). I am grouping queries in my script (in bioruby 1.2.1, not the last git release) with group_by and query.name that return an Hash as you say. Also for my sorting operation (sorting by score, coverage, identity, etc...) is better to work in a small array with only the hits related to one query. Soon I will put somewhere the code for my blatanalyzer.... (ruby version), any suggestion on where to put it? thanks for the kindly translation Davide > -- start of translation > I am Nishiyama at Kanazawa. > > When a multifasta file is used as queries, unlike blast, > blat does not output a header, but instead > outputs the query and target id in each line. > > Bio::Blat::Report, in accordance with that > behavior, seems to return one entry with many > hits. However, as a user, searching with a split file for each query > is undesired, while the results is desired to be aggregated for > each query. > For example when you want the best hit location for each query. > > Although, there is no separator in the output of blat, the result > for the same query comes continuously. > When processing as a FlatFile, it would be useful > to return a block with the same query name as an "entry", > I made "flatfile_splitter". > Because each line is parsed for determination of split positioin, > return value were made as an Array of Hit, so that Hit.new > need not be called again. (For the speed this would about 20% > difference.) > > When processing a psl file of 100-200 Mbytes, more than several > Gbytes of > memory were required with a system reading the whole data into > a Hash and processing the hits for each query, > but with this system much smaller memory is sufficient. > > What do you think? > > -- end of translation > > The remainder are the diff of the source code. > Note that the name of class and file are changed to avoid collision > and the > behavior of the original class is not changed. > > On 2008/09/04, at 18:11, Davide Rambaldi wrote: > >> >> On Sep 4, 2008, at 5:52 AM, Naohisa GOTO wrote: >> >>> This is somehow incompatible, but good at speed and memory usage. >>> In addition, some people requested. >>> http://lists.open-bio.org/pipermail/bioruby-ja/2007-August/ >>> 000137.html >>> (Mailing list written in Japanese) >> >> >> ehm... any good translator from japanese to english (or better >> italian!) ? :P >> >> anyway I am agree that the strange case of mixed hits can be ignored. >> >> This commits will be available in the next version of bioruby? >> >> I have bioruby on the edge in my laptop but not on the cluster... >> >> Last question (sorry for asking everything), there is a way to >> generate docs of boiruby that can be queried with the ri command? >> >> ri Bio::Blat::Report >> Nothing known about Bio::Blat::Report >> >> >> Thanks! >> >> Davide Rambaldi, >> Bioinformatics PhD student. >> ----------------------------------------------------- >> Bioinformatic Group IFOM-IEO Campus >> Via Adamello 16, Milano >> I-20139 Italy >> >> [t] +39 02574303 066 >> [e] davide.rambaldi at ifom-ieo-campus.it >> [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/ >> DavideRambaldi (homepage) >> [i] http://www.semm.it (PhD school) >> [i] http://www.btbs.unimib.it/ (Master) >> >> ----------------------------------------------------- >> >> >> >> >> _______________________________________________ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > > -- > Tomoaki NISHIYAMA > > Advanced Science Research Center, > Kanazawa University, > 13-1 Takara-machi, > Kanazawa, 920-0934, Japan > > Davide Rambaldi, Bioinformatics PhD student. ----------------------------------------------------- Bioinformatic Group IFOM-IEO Campus Via Adamello 16, Milano I-20139 Italy [t] +39 02574303 066 [e] davide.rambaldi at ifom-ieo-campus.it [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/DavideRambaldi (homepage) [i] http://www.semm.it (PhD school) [i] http://www.btbs.unimib.it/ (Master) ----------------------------------------------------- From donttrustben at gmail.com Fri Sep 5 13:12:18 2008 From: donttrustben at gmail.com (Ben Woodcroft) Date: Fri, 5 Sep 2008 23:12:18 +1000 Subject: [BioRuby] problem while handling large fasta files In-Reply-To: <20080905014722.C0BFE1CBC5FE@idnmail.gen-info.osaka-u.ac.jp> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> <20080904035257.5B6EB1CBC5D9@idnmail.gen-info.osaka-u.ac.jp> <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> <26079.84.59.22.23.1220529739.squirrel@webmail.science.uva.nl> <20080904130200.C30ED1CBC5DC@idnmail.gen-info.osaka-u.ac.jp> <27147.84.59.22.23.1220535147.squirrel@webmail.science.uva.nl> <20080905014722.C0BFE1CBC5FE@idnmail.gen-info.osaka-u.ac.jp> Message-ID: Or you could use the RUBYLIB environment variable - set it to your bioruby lib/ directory and then you don't have to modify your scripts at all. The advantage of doing this is that your choice of gem/github bioruby version doesn't impact your scripts at all, and so when you change it is much easier. 2008/9/5 Naohisa GOTO > On Thu, 4 Sep 2008 15:32:27 +0200 (CEST) > "K. Patil" wrote: > > > Oops, sorry for incomplete information. Here it is; > > > > Ruby: 1.8 > > Bioruby: 1.0.0 > > OS/CPU: 2.6.24.2.1.amd64-smp #1 SMP Mon Feb 11 12:43:21 UTC 2008 x86_64 > > GNU/Linux > > The BioRuby 1.0.0 is too old! > > The only thing I can say is the problem may not occur > in the latest version of BioRuby, at least 1.2.1. > > > Also I cannot upgrade Ruby/Bioruby easily as I don't have appropriate > > permissions (all packages are installed by the administrator on request). > > BioRuby (and also Ruby) can be installed in your home directory, > without root (administrator) permission. > > The simplest way is: > > % cd somewhere > % wget http://bioruby.open-bio.org/archive/bioruby-1.2.1.tar.gz > % tar zxvf bioruby-1.2.1.tar.gz > > And then, when running your script, > > % ruby -I /full/path/to/somewhere/bioruby-1.2.1/lib example.rb > (The "/full/path/to/somewhere" is the path you extracted > the bioruby archive.) > > If you want to use irb, > > % ruby -I /full/path/to/somewhere/bioruby-1.2.1/lib -r bio > > Alternatively, put > > $LOAD_PATH.unshift("/full/path/to/somewhere/bioruby-1.2.1/lib") > > before the require 'bio' in your script. > > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > > > > > thanks and regards, > > kaustubh > > > > > > > Hi, > > > > > > Please show which BioRuby version, Ruby version, OS, > > > architecture (type of CPU) you are using. > > > > > > Is the Ruby and/or BioRuby version older? > > > > > > Naohisa Goto > > > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > > > > > On Thu, 4 Sep 2008 14:02:19 +0200 (CEST) > > > "K. Patil" wrote: > > > > > >> Hi, > > >> > > >> I am trying to do some simple processing on fasta files. It works file > > >> for > > >> small files (upto several MB). But as soon as I move to very large > files > > >> (e.g. 2.2 GB) the program crashes. Any help/suggestions highly > > >> appreciated. > > >> > > >> Best regards, > > >> Kaustubh Patil > > >> > > >> I am pasting a very simple example below (the file is 2.2GB); > > >> > > >> irb(main):021:0> fasta = Bio::FastaFormat.open("9606.2.fna") > > >> => # > >> @splitter=# > >> @stream=# > >> @io=# > >> @io=#, @buffer="", @path="9606.2.fna">, > > >> > @buffer=">9606.2.fna\ntaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac\n", > > >> @path="9606.2.fna">, @header=nil, @delimiter="\n>", > > >> @delimiter_overrun=1>, > > >> @firsttime_flag=true, > > >> @stream=# > >> @io=# > >> @io=#, @buffer="", @path="9606.2.fna">, > > >> > @buffer=">9606.2.fna\ntaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac\n", > > >> @path="9606.2.fna">, @skip_leader_mode=:firsttime, @raw=false, > > >> @dbclass=Bio::FastaFormat> > > >> irb(main):022:0> fasta.each do |seq| > > >> irb(main):023:1* print seq.data > > >> irb(main):024:1> end > > >> NoMethodError: private method `sub' called for nil:NilClass > > >> from /usr/lib/ruby/1.8/bio/db/fasta.rb:156:in `initialize' > > >> from /usr/lib/ruby/1.8/bio/io/flatfile.rb:579:in `new' > > >> from /usr/lib/ruby/1.8/bio/io/flatfile.rb:579:in `next_entry' > > >> from /usr/lib/ruby/1.8/bio/io/flatfile.rb:609:in `each' > > >> from (irb):22 > > >> > > >> > > >> _______________________________________________ > > >> BioRuby mailing list > > >> BioRuby at lists.open-bio.org > > >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > > > > > > > > > > > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > -- FYI: My email addresses at unimelb, uq and gmail all redirect to the same place. From sgujja at broad.mit.edu Fri Sep 5 14:22:57 2008 From: sgujja at broad.mit.edu (Sharvari Gujja) Date: Fri, 05 Sep 2008 10:22:57 -0400 Subject: [BioRuby] Multi Fasta Sequence file to Genbank conversion.. In-Reply-To: <20080905013427.D6BEE1CBC5EF@idnmail.gen-info.osaka-u.ac.jp> References: <48BFF657.1080302@broad.mit.edu> <134ede0b0809041613o4cd536ddwa183f88f377e108b@mail.gmail.com> <20080905013427.D6BEE1CBC5EF@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <48C140C1.1010905@broad.mit.edu> Hi, Thank you so much for the reply. However, I get the following error on running this code: *require 'bio' Bio::FlatFile.open($<) do |ff| ff.each do |e| print e.to_biosequence.output(:genbank) end end* undefined method `to_biosequence' for # (NoMethodError) And running this code gives me: *include Bio fasta = Alignment::MultiFastaFormat.new(File.open('my.fasta').read) fasta.entries.each do |seq| puts seq.to_seq.output(:genbank) end* uninitialized constant Alignment (NameError)... I guess this is something to do with rubygems. Also, I believe this would generate a genbank file for each sequence in the multi-fasta file. Is there a way to get single Genbank file for the multi-fasta sequence file? Appreciate all the help. Thanks S Naohisa GOTO wrote: > Hi, > > On Thu, 4 Sep 2008 19:13:25 -0400 > "Adam Kraut" wrote: > > >> I've never used the genbank format, but in Bioruby you could try: >> >> include Bio >> >> fasta = Alignment::MultiFastaFormat.new(File.open('my.fasta').read) >> fasta.entries.each do |seq| >> puts seq.to_seq.output(:genbank) >> end >> > > No need to use Bio::Alignment::MultiFastaFormat in this case. > Bio::FlatFile alone can do. > > For example, to read from stdin and output to stdout, > > require 'bio' > Bio::FlatFile.open($<) do |ff| > ff.each do |e| > print e.to_biosequence.output(:genbank) > end > end > > Note that the output(:genbank) are new feature only in > the latest development version in the git repository. > http://github.com/bioruby/bioruby > (i.e. in BioRuby 1.2.1, above examples cannot be run.) > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > >> The only tricky part is perhaps is the to_seq call for a Bio::Sequence >> object which has different output format methods. >> -Adam >> >> On Thu, Sep 4, 2008 at 10:53 AM, Sharvari Gujja wrote: >> >> >>> Hi, >>> >>> I am trying to convert a multi fasta sequence file (nucleotide/protein) to >>> genbank format.Is there a way to do this using Bioruby? >>> >>> Appreciate any input/suggestions. >>> >>> Thanks >>> S >>> _______________________________________________ >>> BioRuby mailing list >>> BioRuby at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioruby >>> >>> >> _______________________________________________ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby >> > > > > From adamnkraut at gmail.com Fri Sep 5 15:00:31 2008 From: adamnkraut at gmail.com (Adam Kraut) Date: Fri, 5 Sep 2008 11:00:31 -0400 Subject: [BioRuby] Multi Fasta Sequence file to Genbank conversion.. In-Reply-To: <48C140C1.1010905@broad.mit.edu> References: <48BFF657.1080302@broad.mit.edu> <134ede0b0809041613o4cd536ddwa183f88f377e108b@mail.gmail.com> <20080905013427.D6BEE1CBC5EF@idnmail.gen-info.osaka-u.ac.jp> <48C140C1.1010905@broad.mit.edu> Message-ID: <134ede0b0809050800h3e3fbd43s9845bb4798bd37df@mail.gmail.com> Naohisa, thanks for clearing that up. I knew there was a better way ;) Sharvari, which version of Bioruby have you installed? Both examples will print everything to stdout, which you can redirect to a single file. On Fri, Sep 5, 2008 at 10:22 AM, Sharvari Gujja wrote: > Hi, > > Thank you so much for the reply. > > However, I get the following error on running this code: > > *require 'bio' > Bio::FlatFile.open($<) do |ff| > ff.each do |e| > print e.to_biosequence.output(:genbank) > end > end* > > > undefined method `to_biosequence' for # > (NoMethodError) > > And running this code gives me: > > *include Bio > > fasta = Alignment::MultiFastaFormat.new(File.open('my.fasta').read) > fasta.entries.each do |seq| > puts seq.to_seq.output(:genbank) > end* > > uninitialized constant Alignment (NameError)... > > I guess this is something to do with rubygems. > > Also, I believe this would generate a genbank file for each sequence in the > multi-fasta file. Is there a way to get single Genbank file for the > multi-fasta sequence file? > > Appreciate all the help. > > Thanks > S > > > Naohisa GOTO wrote: > >> Hi, >> >> On Thu, 4 Sep 2008 19:13:25 -0400 >> "Adam Kraut" wrote: >> >> >> >>> I've never used the genbank format, but in Bioruby you could try: >>> >>> include Bio >>> >>> fasta = Alignment::MultiFastaFormat.new(File.open('my.fasta').read) >>> fasta.entries.each do |seq| >>> puts seq.to_seq.output(:genbank) >>> end >>> >>> >> >> No need to use Bio::Alignment::MultiFastaFormat in this case. >> Bio::FlatFile alone can do. >> >> For example, to read from stdin and output to stdout, >> >> require 'bio' >> Bio::FlatFile.open($<) do |ff| >> ff.each do |e| >> print e.to_biosequence.output(:genbank) >> end >> end >> >> Note that the output(:genbank) are new feature only in >> the latest development version in the git repository. >> http://github.com/bioruby/bioruby >> (i.e. in BioRuby 1.2.1, above examples cannot be run.) >> >> Naohisa Goto >> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org >> >> >> >>> The only tricky part is perhaps is the to_seq call for a Bio::Sequence >>> object which has different output format methods. >>> -Adam >>> >>> On Thu, Sep 4, 2008 at 10:53 AM, Sharvari Gujja >> >wrote: >>> >>> >>> >>>> Hi, >>>> >>>> I am trying to convert a multi fasta sequence file (nucleotide/protein) >>>> to >>>> genbank format.Is there a way to do this using Bioruby? >>>> >>>> Appreciate any input/suggestions. >>>> >>>> Thanks >>>> S >>>> _______________________________________________ >>>> BioRuby mailing list >>>> BioRuby at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioruby >>>> >>>> >>>> >>>> >>> _______________________________________________ >>> BioRuby mailing list >>> BioRuby at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioruby >>> >>> >> >> >> >> >> > From sgujja at broad.mit.edu Fri Sep 5 15:11:36 2008 From: sgujja at broad.mit.edu (Sharvari Gujja) Date: Fri, 05 Sep 2008 11:11:36 -0400 Subject: [BioRuby] Multi Fasta Sequence file to Genbank conversion.. In-Reply-To: <134ede0b0809050800h3e3fbd43s9845bb4798bd37df@mail.gmail.com> References: <48BFF657.1080302@broad.mit.edu> <134ede0b0809041613o4cd536ddwa183f88f377e108b@mail.gmail.com> <20080905013427.D6BEE1CBC5EF@idnmail.gen-info.osaka-u.ac.jp> <48C140C1.1010905@broad.mit.edu> <134ede0b0809050800h3e3fbd43s9845bb4798bd37df@mail.gmail.com> Message-ID: <48C14C28.2000903@broad.mit.edu> Hi Adam, I am using bioruby version is 1.2.1. How can I upgrade to the new version? Also,the final output file would contain genbank format for each fasta sequence right? I am interested in getting a single genabank file for all the sequences. Thanks S Adam Kraut wrote: > Naohisa, thanks for clearing that up. I knew there was a better way ;) > > Sharvari, which version of Bioruby have you installed? Both examples > will print everything to stdout, which you can redirect to a single file. > > On Fri, Sep 5, 2008 at 10:22 AM, Sharvari Gujja > wrote: > > Hi, > > Thank you so much for the reply. > > However, I get the following error on running this code: > > *require 'bio' > > Bio::FlatFile.open($<) do |ff| > ff.each do |e| > print e.to_biosequence.output(:genbank) > end > end* > > > undefined method `to_biosequence' for > # (NoMethodError) > > And running this code gives me: > > *include Bio > > > fasta = Alignment::MultiFastaFormat.new(File.open('my.fasta').read) > fasta.entries.each do |seq| > puts seq.to_seq.output(:genbank) > end* > > uninitialized constant Alignment (NameError)... > > I guess this is something to do with rubygems. > > Also, I believe this would generate a genbank file for each > sequence in the multi-fasta file. Is there a way to get single > Genbank file for the multi-fasta sequence file? > > Appreciate all the help. > > Thanks > S > > > Naohisa GOTO wrote: > > Hi, > > On Thu, 4 Sep 2008 19:13:25 -0400 > "Adam Kraut" > wrote: > > > > I've never used the genbank format, but in Bioruby you > could try: > > include Bio > > fasta = > Alignment::MultiFastaFormat.new(File.open('my.fasta').read) > fasta.entries.each do |seq| > puts seq.to_seq.output(:genbank) > end > > > > No need to use Bio::Alignment::MultiFastaFormat in this case. > Bio::FlatFile alone can do. > > For example, to read from stdin and output to stdout, > > require 'bio' > Bio::FlatFile.open($<) do |ff| > ff.each do |e| > print e.to_biosequence.output(:genbank) > end > end > > Note that the output(:genbank) are new feature only in > the latest development version in the git repository. > http://github.com/bioruby/bioruby > (i.e. in BioRuby 1.2.1, above examples cannot be run.) > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp > / ng at bioruby.org > > > > > The only tricky part is perhaps is the to_seq call for a > Bio::Sequence > object which has different output format methods. > -Adam > > On Thu, Sep 4, 2008 at 10:53 AM, Sharvari Gujja > >wrote: > > > > Hi, > > I am trying to convert a multi fasta sequence file > (nucleotide/protein) to > genbank format.Is there a way to do this using Bioruby? > > Appreciate any input/suggestions. > > Thanks > S > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > > > > > > > > > From raoul.bonnal at itb.cnr.it Fri Sep 5 07:42:30 2008 From: raoul.bonnal at itb.cnr.it (Raoul Jean Pierre Bonnal) Date: Fri, 05 Sep 2008 09:42:30 +0200 Subject: [BioRuby] problem while handling large fasta files In-Reply-To: <20080905014722.C0BFE1CBC5FE@idnmail.gen-info.osaka-u.ac.jp> References: <2C4D44A5-A886-4913-8651-997F30E758F0@ifom-ieo-campus.it> <20080903133429.01BE51CBC5B7@idnmail.gen-info.osaka-u.ac.jp> <81F4A400-CA95-466C-8814-4B0D90C44076@ifom-ieo-campus.it> <20080904035257.5B6EB1CBC5D9@idnmail.gen-info.osaka-u.ac.jp> <6A0A4704-AA15-4D8F-99EB-465E6341C4B2@ifom-ieo-campus.it> <26079.84.59.22.23.1220529739.squirrel@webmail.science.uva.nl> <20080904130200.C30ED1CBC5DC@idnmail.gen-info.osaka-u.ac.jp> <27147.84.59.22.23.1220535147.squirrel@webmail.science.uva.nl> <20080905014722.C0BFE1CBC5FE@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <1220600550.7632.3.camel@454-2> Il giorno ven, 05/09/2008 alle 10.47 +0900, Naohisa GOTO ha scritto: > On Thu, 4 Sep 2008 15:32:27 +0200 (CEST) > "K. Patil" wrote: > > > Oops, sorry for incomplete information. Here it is; > > > > Ruby: 1.8 > > Bioruby: 1.0.0 > > OS/CPU: 2.6.24.2.1.amd64-smp #1 SMP Mon Feb 11 12:43:21 UTC 2008 x86_64 > > GNU/Linux > > The BioRuby 1.0.0 is too old! and use the latest Ruby release, I had some problem handling huge data with 1.8.6 -- Ra From tomoakin at kenroku.kanazawa-u.ac.jp Fri Sep 5 06:43:05 2008 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Fri, 5 Sep 2008 15:43:05 +0900 Subject: [BioRuby] GFF attributes Message-ID: Hi, When extracting attributes from a GFF file, older implementation seem to have eat the last character before ";". Current, (downloaded very recently from github), does not split well, as the regular expression search the largest match. A patch is included, but I am not sure on the specification. http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml The specification says: > From version 2 onwards, the attribute field must have an tag value > structure following the syntax used within objects in a .ace file, > flattened onto one line by semicolon separators. Tags must be > standard identifiers ([A-Za-z][A-Za-z0-9_]*). Free text values must > be quoted with double quotes. Note: all non-printing characters in > such free text value strings (e.g. newlines, tabs, control > characters, etc) must be explicitly represented by their C (UNIX) > style backslash-escaped representation (e.g. newlines as '\n', tabs > as '\t'). So, it seems that for proper parsing, quotation with double quote should be checked for free text, and semicolon in that quatation is not a separator for attributes and semicolon may not be preceeded with back slash. Anyway, the file I am looking now is not that complex, and I will go with a quick hack at this time. Best regards, Tomoaki the test program $ cat test-gff.rb #!/usr/local/bin/ruby require 'bio' gff_str = "LG_I\tJGI\tCDS\t11052\t11064\t.\t-\t0\tname \"grail3.0116000101\"; proteinId 639579; exonNumber 3\n" Bio::GFF.new(gff_str).records.each do |fr| p fr end output after patch $ /usr/local/bin/ruby test-gff.rb #"\"grail3.0116000101\"", "proteinId"=>"639579", "exonNumber"=>"3"}, @end="11064", @seqname="LG_I"> output from current #"\"grail3.0116000101\"; proteinId 639579", "exonNumber"=>"3"}, @end="11064", @seqname="LG_I"> older output #"\"grail3.0116000101", "proteinId"=>"63957", "exonNumber"=>"3"}> diff -ur bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ bio/db/gff.rb bioruby-a/lib/bio/db/gff.rb --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ db/gff.rb 2008-09-03 22:24:39.000000000 +0900 +++ bioruby-a/lib/bio/db/gff.rb 2008-09-05 14:56:50.000000000 +0900 @@ -122,7 +122,7 @@ def parse_attributes(attributes) hash = Hash.new scanner = StringScanner.new(attributes) - while scanner.scan(/(.*[^\\])\;/) or scanner.scan(/(.+)/) + while scanner.scan(/(([^;]|\\;)*[^\\])\;/) or scanner.scan(/ (.+)/) key, value = scanner[1].split(' ', 2) key.strip! value.strip! if value -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan From davide.rambaldi at ifom-ieo-campus.it Mon Sep 8 12:31:45 2008 From: davide.rambaldi at ifom-ieo-campus.it (Davide Rambaldi) Date: Mon, 8 Sep 2008 14:31:45 +0200 Subject: [BioRuby] blatanalyzer.rb Message-ID: <606898F4-9458-422A-9E42-ECC852BD7967@ifom-ieo-campus.it> Hi, I have published a first version of my command-line "application" that use BioRuby: blatanalyzer at http://rubyforge.org/projects/ blatanalyzer/. Blatanalyzer is a software to analize the output of blat software alignment (PSL files): list query names,sort by identity, coverage, score, span. convert to: gff, gtf formats generate: report tables, PSL, GFF and GTF files Available Actions: gff, list, cut, duplicates, gtf, report, singletons, table, summary gff,gtf: conversion to gff,gtf list: generate list of query names cut: extract psl alignments over/under a given threshold (identity, span, coverage, score) report, table, summary: generate pretty reports, table is like the web-blat output table, report is a custom table with coverage and span, summary print a list of query names with number of alignments and number of distinct chromosomes target. more actions coming... basically is composed by: - an OptionParser class - a Bio::Blat::Application class (implement actions) - a Bio::Blat::Analyzer class (subclass of Bio::Blat::Report) Any suggestion is really appreciated! Davide Rambaldi, Bioinformatics PhD student. ----------------------------------------------------- Bioinformatic Group IFOM-IEO Campus Via Adamello 16, Milano I-20139 Italy [t] +39 02574303 066 [e] davide.rambaldi at ifom-ieo-campus.it [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/DavideRambaldi (homepage) [i] http://www.semm.it (PhD school) [i] http://www.btbs.unimib.it/ (Master) ----------------------------------------------------- From pjotr2008 at thebird.nl Tue Sep 9 11:38:16 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Tue, 9 Sep 2008 13:38:16 +0200 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080902091958.GA31400@thebird.nl> References: <20080902065055.GA29634@thebird.nl> <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> <20080902091958.GA31400@thebird.nl> Message-ID: <20080909113816.GA10051@thebird.nl> I wrote a simple file Cache Singleton. See: http://github.com/pjotrp/bioruby/tree/462614487767568f41db03d894875a3d78ced08e/lib/bio/db/microarray/cache.rb The Cache can be read and set with: dir = Bio::Microarray::Cache.instance.directory('GEO') # override cache dir dir = Bio::Microarray::Cache.instance.set(newcachedir,'GEO') Everyone OK with this? Pj. On Tue, Sep 02, 2008 at 11:19:58AM +0200, Pjotr Prins wrote: > > Note that some classes use Tempfile class, a standard bundled > > class with Ruby by default, and the Tempfile class depends > > on enviroment variables (TMPDIR, TMP, etc.). > > I noticed. Caching is a bit different in nature - as caches may be > there for a long time. TMPDIRs get emptied on reboot, for one. > > > I think cache isn't suitable for standard, because its purpose > > may differ from program (or class, module, etc.) to program. > > > For example, if I want to put class A's cache on a fast hard disk > > with very large size, and program B's cache on a slower hard disk > > with small size, what should I do? > > That is true. OK, leave caching for the modules to resolve. I'll use > my own caching of GEO XML objects. From ngoto at gen-info.osaka-u.ac.jp Tue Sep 9 11:47:46 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Tue, 9 Sep 2008 20:47:46 +0900 Subject: [BioRuby] GFF attributes In-Reply-To: References: Message-ID: <20080909114748.555DB1CBC528@idnmail.gen-info.osaka-u.ac.jp> Hi, On Fri, 5 Sep 2008 15:43:05 +0900 Tomoaki NISHIYAMA wrote: > Hi, > > When extracting attributes from a GFF file, > older implementation seem to have eat the last character before ";". > Current, (downloaded very recently from github), does not split well, > as the regular expression search the largest match. Thank you for reporting a bug. > A patch is included, but I am not sure on the specification. > http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml > The specification says: > > From version 2 onwards, the attribute field must have an tag value > > structure following the syntax used within objects in a .ace file, > > flattened onto one line by semicolon separators. Tags must be > > standard identifiers ([A-Za-z][A-Za-z0-9_]*). Free text values must > > be quoted with double quotes. Note: all non-printing characters in > > such free text value strings (e.g. newlines, tabs, control > > characters, etc) must be explicitly represented by their C (UNIX) > > style backslash-escaped representation (e.g. newlines as '\n', tabs > > as '\t'). I also see BioPerl's _from_gff2_string in Bio::Tools::GFF http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/Tools/GFF.html#CODE10 It seems is still has bugs (as described in comments in their code), but semicolons inside double quotes are treated as normal letters and not separators for attributes. > So, it seems that for proper parsing, quotation with double quote > should be checked for free text, > and semicolon in that quatation is not a separator > for attributes and semicolon may not be preceeded with back slash. I've changed to do so. This means the patch was not used. http://github.com/ngoto/bioruby/commit/e38fd48aaf41f94eaec39a639a7f6c5db62c22e8 (This is my repository. Because the change seems severe, I'll push to the main bioruby repository later, after checking more and more.) To prevent repeating the bug, I want to use the GFF string described in your mail for the test script in BioRuby. (test/unit/bio/db/test_gff.rb) Can you give permission? Best regards, Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > Anyway, the file I am looking now is not that complex, > and I will go with a quick hack at this time. > > Best regards, > > Tomoaki > > the test program > $ cat test-gff.rb > #!/usr/local/bin/ruby > require 'bio' > gff_str = "LG_I\tJGI\tCDS\t11052\t11064\t.\t-\t0\tname > \"grail3.0116000101\"; proteinId 639579; exonNumber 3\n" > Bio::GFF.new(gff_str).records.each do |fr| > p fr > end > > output after patch > $ /usr/local/bin/ruby test-gff.rb > # @comments=nil, @strand="-", @feature="CDS", @score=".", > @source="JGI", @attributes={"name"=>"\"grail3.0116000101\"", > "proteinId"=>"639579", "exonNumber"=>"3"}, @end="11064", > @seqname="LG_I"> > > output from current > # @comments=nil, @strand="-", @feature="CDS", @score=".", > @source="JGI", @attributes={"name"=>"\"grail3.0116000101\"; proteinId > 639579", "exonNumber"=>"3"}, @end="11064", @seqname="LG_I"> > > older output > # @frame="0", @start="11052", @comments=nil, @strand="-", > @feature="CDS", @score=".", @source="JGI", @attributes= > {"name"=>"\"grail3.0116000101", "proteinId"=>"63957", > "exonNumber"=>"3"}> > > diff -ur bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ > bio/db/gff.rb bioruby-a/lib/bio/db/gff.rb > --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ > db/gff.rb 2008-09-03 22:24:39.000000000 +0900 > +++ bioruby-a/lib/bio/db/gff.rb 2008-09-05 14:56:50.000000000 +0900 > @@ -122,7 +122,7 @@ > def parse_attributes(attributes) > hash = Hash.new > scanner = StringScanner.new(attributes) > - while scanner.scan(/(.*[^\\])\;/) or scanner.scan(/(.+)/) > + while scanner.scan(/(([^;]|\\;)*[^\\])\;/) or scanner.scan(/ > (.+)/) > key, value = scanner[1].split(' ', 2) > key.strip! > value.strip! if value > > > -- > Tomoaki NISHIYAMA > > Advanced Science Research Center, > Kanazawa University, > 13-1 Takara-machi, > Kanazawa, 920-0934, Japan > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From ngoto at gen-info.osaka-u.ac.jp Wed Sep 10 01:48:20 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 10 Sep 2008 10:48:20 +0900 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080909113816.GA10051@thebird.nl> References: <20080902065055.GA29634@thebird.nl> <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> <20080902091958.GA31400@thebird.nl> <20080909113816.GA10051@thebird.nl> Message-ID: <20080910014822.36F5F1CBC3D3@idnmail.gen-info.osaka-u.ac.jp> Hi, I think the most important thing for cache is data integrity. For example, timing for detecting updates of original data, controlling accesses and resolving race conditions (two or more processes or threads simultaneously want to use, update, create, and/or remove the same cache data). However, your code only contains directory name determination. line 24: > def set directory, subdir = nil In def lines, please use parentheses explicitly, e.g. def set(directory, subdir = nil), because most of existing code in BioRuby does so. line 28: > dir = dir + '/' + subdir File.join(dir, subdir) should be used, possibly to support non-UNIX systems like Windows. lines 41 to 45: > if cache==nil or cache=='' > cache = ENV['TMPDIR'] > end > cache = '/tmp' if cache==nil or cache=='' > set cache, subdir Using Dir.tmpdir defined in tempdir.rb is better. http://www.ruby-doc.org/stdlib/libdoc/tmpdir/rdoc/index.html Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Tue, 9 Sep 2008 13:38:16 +0200 pjotr2008 at thebird.nl (Pjotr Prins) wrote: > I wrote a simple file Cache Singleton. See: > > http://github.com/pjotrp/bioruby/tree/462614487767568f41db03d894875a3d78ced08e/lib/bio/db/microarray/cache.rb > > The Cache can be read and set with: > > dir = Bio::Microarray::Cache.instance.directory('GEO') > # override cache dir > dir = Bio::Microarray::Cache.instance.set(newcachedir,'GEO') > > Everyone OK with this? > > Pj. > > On Tue, Sep 02, 2008 at 11:19:58AM +0200, Pjotr Prins wrote: > > > Note that some classes use Tempfile class, a standard bundled > > > class with Ruby by default, and the Tempfile class depends > > > on enviroment variables (TMPDIR, TMP, etc.). > > > > I noticed. Caching is a bit different in nature - as caches may be > > there for a long time. TMPDIRs get emptied on reboot, for one. > > > > > I think cache isn't suitable for standard, because its purpose > > > may differ from program (or class, module, etc.) to program. > > > > > For example, if I want to put class A's cache on a fast hard disk > > > with very large size, and program B's cache on a slower hard disk > > > with small size, what should I do? > > > > That is true. OK, leave caching for the modules to resolve. I'll use > > my own caching of GEO XML objects. From pjotr2008 at thebird.nl Wed Sep 10 07:48:58 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Wed, 10 Sep 2008 09:48:58 +0200 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080910014822.36F5F1CBC3D3@idnmail.gen-info.osaka-u.ac.jp> References: <20080902065055.GA29634@thebird.nl> <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> <20080902091958.GA31400@thebird.nl> <20080909113816.GA10051@thebird.nl> <20080910014822.36F5F1CBC3D3@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20080910074858.GA16861@thebird.nl> Hi Naohisa, Thanks for comments. See below. On Wed, Sep 10, 2008 at 10:48:20AM +0900, Naohisa GOTO wrote: > Hi, > > I think the most important thing for cache is data integrity. > For example, timing for detecting updates of original data, > controlling accesses and resolving race conditions > (two or more processes or threads simultaneously want to > use, update, create, and/or remove the same cache data). > However, your code only contains directory name determination. Well, caching is a universal term for storing stuff intermediately. And what I need is a place to put files. With regard to race conditions you are right - if two processes were to download the same file it would get mangled. However, them being XML the program would throw an error on parsing. For me that works well enough. For BioRuby we may need to think of something more universal - and it is not that hard to do. That is why I wrote my earlier mail. If you want to support something universal it should be at a higher point in the source tree. But maybe leave it until someone gets an itch to scratch. > line 24: > > def set directory, subdir = nil > > In def lines, please use parentheses explicitly, > e.g. def set(directory, subdir = nil), > because most of existing code in BioRuby does so. I like the 'most'. But OK. > line 28: > > dir = dir + '/' + subdir > > File.join(dir, subdir) should be used, possibly to support > non-UNIX systems like Windows. OK > lines 41 to 45: > > if cache==nil or cache=='' > > cache = ENV['TMPDIR'] > > end > > cache = '/tmp' if cache==nil or cache=='' > > set cache, subdir > > Using Dir.tmpdir defined in tempdir.rb is better. > http://www.ruby-doc.org/stdlib/libdoc/tmpdir/rdoc/index.html Thanks, Pj. From pjotr2008 at thebird.nl Wed Sep 10 10:27:10 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Wed, 10 Sep 2008 12:27:10 +0200 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080910074858.GA16861@thebird.nl> References: <20080902065055.GA29634@thebird.nl> <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> <20080902091958.GA31400@thebird.nl> <20080909113816.GA10051@thebird.nl> <20080910014822.36F5F1CBC3D3@idnmail.gen-info.osaka-u.ac.jp> <20080910074858.GA16861@thebird.nl> Message-ID: <20080910102710.GA18178@thebird.nl> I have made available for testing Bio::Microarray support for Affy and GEO XML and MINiML formats. The next step will be support for RMA and quantile normalisation. See: http://github.com/pjotrp/bioruby/tree/master http://github.com/pjotrp/bioruby/tree/master/lib/bio/db/microarray git://github.com/pjotrp/bioruby.git Enjoy, Pj. From pjotr2008 at thebird.nl Wed Sep 10 10:36:45 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Wed, 10 Sep 2008 12:36:45 +0200 Subject: [BioRuby] Introducing microarray support in BioRuby Message-ID: <20080910103645.GA18598@thebird.nl> Sorry, should have used a different Subject. On Wed, Sep 10, 2008 at 12:27:10PM +0200, Pjotr Prins wrote: > I have made available for testing Bio::Microarray support for Affy and > GEO XML and MINiML formats. The next step will be support for RMA and > quantile normalisation. See: > > http://github.com/pjotrp/bioruby/tree/master > > http://github.com/pjotrp/bioruby/tree/master/lib/bio/db/microarray > > git://github.com/pjotrp/bioruby.git > > Enjoy, > > Pj. > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From tomoakin at kenroku.kanazawa-u.ac.jp Thu Sep 11 01:51:43 2008 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Thu, 11 Sep 2008 10:51:43 +0900 Subject: [BioRuby] Translate ambiguous sequence Message-ID: <9E7111E0-F2DC-485D-B829-C7BD116517CD@kenroku.kanazawa-u.ac.jp> Hi, Bioruby's translate any codon containing ambiguity code to unknown or "X". However, sometimes, it is desirable to translate into a fixed amino acid when it is possible. tty -> "F" seeing the core implementation being naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or unknown} changing unknown to ct.translate_ambiguity(codon, unknown) will not hurt the performance for sequence without ambiguity, and trying to resolve degenerate codons is worth to do. Also, the sequence in GenBank is usually translated as such. What do you think? diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ bio/data/codontable.rb bioruby-c/lib/bio/data/codontable.rb --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ data/codontable.rb 2008-09-03 22:24:39.000000000 +0900 +++ bioruby-c/lib/bio/data/codontable.rb 2008-09-11 09:49:23.000000000 +0900 @@ -93,6 +93,23 @@ def [](codon) @table[codon] end + def translate_ambiguity(codon, unknown = 'X') + triplet = codon + "NNN" + aa = nil + Bio::NucleicAcid.ambiguity2individual(triplet[2..2]).each do|third| + Bio::NucleicAcid.ambiguity2individual(triplet[0..0]).each do| first| + Bio::NucleicAcid.ambiguity2individual(triplet[1..1]).each do| second| + if aa == nil + aa = @table[first+second+third] + elsif + aa != @table[first+second+third] + return unknown + end + end + end + end + aa + end # Modify the codon table. Use with caution as it may break hard coded # tables. If you want to modify existing table, you should use copy diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ bio/data/na.rb bioruby-c/lib/bio/data/na.rb --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ data/na.rb 2008-09-03 22:24:39.000000000 +0900 +++ bioruby-c/lib/bio/data/na.rb 2008-09-11 09:26:00.000000000 +0900 @@ -182,6 +182,13 @@ end Regexp.new(str) end + def ambiguity2individual(na, rna = false) + str = NAMES[na.downcase].gsub(/[\[\]]/,"") + if rna + str.tr!("t", "u") + end + str.split(//) + end end diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ bio/sequence/na.rb bioruby-c/lib/bio/sequence/na.rb --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ sequence/na.rb2008-09-03 22:24:39.000000000 +0900 +++ bioruby-c/lib/bio/sequence/na.rb 2008-09-11 09:48:52.000000000 +0900 @@ -252,7 +252,7 @@ end nalen = naseq.length - from nalen -= nalen % 3 - aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or unknown} + aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or ct.translate_ambiguity(codon, unknown)} return Bio::Sequence::AA.new(aaseq) end -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan From tomoakin at kenroku.kanazawa-u.ac.jp Thu Sep 11 02:34:36 2008 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Thu, 11 Sep 2008 11:34:36 +0900 Subject: [BioRuby] GFF attributes In-Reply-To: <20080909114748.555DB1CBC528@idnmail.gen-info.osaka-u.ac.jp> References: <20080909114748.555DB1CBC528@idnmail.gen-info.osaka-u.ac.jp> Message-ID: Hi > To prevent repeating the bug, I want to use the GFF string > described in your mail for the test script in BioRuby. > (test/unit/bio/db/test_gff.rb) > Can you give permission? Surely, I have no objection. The string is one of the line in the Popular genome annotation from the JGI site. ftp://ftp.jgi-psf.org/pub/JGI_data/Poplar/annotation/v1.1/ Poptr1_1.JamboreeModels.gff.gz So, I think acknowledging them is a good idea. For test string, I think another pattern including multiple value for one key is worth to add. The example from http://www.sanger.ac.uk/Software/formats/GFF/ GFF_Spec.shtml: seq1 BLASTX similarity 101 235 87.1 + 0 Target "HBA_HUMAN" 11 55 ; E_value 0.0003 Perhaps current implementation will return '"HBA_HUMAN" 11 55' as the value for 'Target'. But returning an Array ['"HBA_HUMAN"', '11', '55'] may be more sensible, or represent more of the meaning of the specification. Since changing this return value will make incompatibilities, I'm not sure whether it can be changed. But if it is ever to be changed, it is better changed early, or stated as such. If it is too late, perhaps we can make a method under a different name so that currently working code will not be affected. -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan On 2008/09/09, at 20:47, Naohisa GOTO wrote: > Hi, > > On Fri, 5 Sep 2008 15:43:05 +0900 > Tomoaki NISHIYAMA wrote: > >> Hi, >> >> When extracting attributes from a GFF file, >> older implementation seem to have eat the last character before ";". >> Current, (downloaded very recently from github), does not split well, >> as the regular expression search the largest match. > > Thank you for reporting a bug. > >> A patch is included, but I am not sure on the specification. >> http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml >> The specification says: >>> From version 2 onwards, the attribute field must have an tag value >>> structure following the syntax used within objects in a .ace file, >>> flattened onto one line by semicolon separators. Tags must be >>> standard identifiers ([A-Za-z][A-Za-z0-9_]*). Free text values must >>> be quoted with double quotes. Note: all non-printing characters in >>> such free text value strings (e.g. newlines, tabs, control >>> characters, etc) must be explicitly represented by their C (UNIX) >>> style backslash-escaped representation (e.g. newlines as '\n', tabs >>> as '\t'). > > I also see BioPerl's _from_gff2_string in Bio::Tools::GFF > http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/ > Tools/GFF.html#CODE10 > It seems is still has bugs (as described in comments in their code), > but semicolons inside double quotes are treated as normal letters > and not separators for attributes. > >> So, it seems that for proper parsing, quotation with double quote >> should be checked for free text, >> and semicolon in that quatation is not a separator >> for attributes and semicolon may not be preceeded with back slash. > > I've changed to do so. This means the patch was not used. > > http://github.com/ngoto/bioruby/commit/ > e38fd48aaf41f94eaec39a639a7f6c5db62c22e8 > (This is my repository. Because the change seems severe, > I'll push to the main bioruby repository later, > after checking more and more.) > > To prevent repeating the bug, I want to use the GFF string > described in your mail for the test script in BioRuby. > (test/unit/bio/db/test_gff.rb) > Can you give permission? > > Best regards, > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > >> >> Anyway, the file I am looking now is not that complex, >> and I will go with a quick hack at this time. >> >> Best regards, >> >> Tomoaki >> >> the test program >> $ cat test-gff.rb >> #!/usr/local/bin/ruby >> require 'bio' >> gff_str = "LG_I\tJGI\tCDS\t11052\t11064\t.\t-\t0\tname >> \"grail3.0116000101\"; proteinId 639579; exonNumber 3\n" >> Bio::GFF.new(gff_str).records.each do |fr| >> p fr >> end >> >> output after patch >> $ /usr/local/bin/ruby test-gff.rb >> #> @comments=nil, @strand="-", @feature="CDS", @score=".", >> @source="JGI", @attributes={"name"=>"\"grail3.0116000101\"", >> "proteinId"=>"639579", "exonNumber"=>"3"}, @end="11064", >> @seqname="LG_I"> >> >> output from current >> #> @comments=nil, @strand="-", @feature="CDS", @score=".", >> @source="JGI", @attributes={"name"=>"\"grail3.0116000101\"; proteinId >> 639579", "exonNumber"=>"3"}, @end="11064", @seqname="LG_I"> >> >> older output >> #> @frame="0", @start="11052", @comments=nil, @strand="-", >> @feature="CDS", @score=".", @source="JGI", @attributes= >> {"name"=>"\"grail3.0116000101", "proteinId"=>"63957", >> "exonNumber"=>"3"}> >> >> diff -ur bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/ >> lib/ >> bio/db/gff.rb bioruby-a/lib/bio/db/gff.rb >> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ >> db/gff.rb 2008-09-03 22:24:39.000000000 +0900 >> +++ bioruby-a/lib/bio/db/gff.rb 2008-09-05 14:56:50.000000000 +0900 >> @@ -122,7 +122,7 @@ >> def parse_attributes(attributes) >> hash = Hash.new >> scanner = StringScanner.new(attributes) >> - while scanner.scan(/(.*[^\\])\;/) or scanner.scan(/(.+)/) >> + while scanner.scan(/(([^;]|\\;)*[^\\])\;/) or scanner.scan(/ >> (.+)/) >> key, value = scanner[1].split(' ', 2) >> key.strip! >> value.strip! if value >> >> >> -- >> Tomoaki NISHIYAMA >> >> Advanced Science Research Center, >> Kanazawa University, >> 13-1 Takara-machi, >> Kanazawa, 920-0934, Japan >> >> >> _______________________________________________ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From tomoakin at kenroku.kanazawa-u.ac.jp Mon Sep 15 10:08:56 2008 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Mon, 15 Sep 2008 19:08:56 +0900 Subject: [BioRuby] Translate ambiguous sequence In-Reply-To: <9E7111E0-F2DC-485D-B829-C7BD116517CD@kenroku.kanazawa-u.ac.jp> References: <9E7111E0-F2DC-485D-B829-C7BD116517CD@kenroku.kanazawa-u.ac.jp> Message-ID: Hi, To further make translation compatible what is done between DNA entry and protein entry in databases, I thought that special treatment of the start codon and incomplete codons are necessary. Special treatment of the start codons are for those codons that is translated to M only when it is used as the start codon and a different amino acids if it is used as an internal codon within a CDS. For example GUG is V if it is internal to the CDS, but it can also serve as a start codon and in that case it encodes M. To change the behavior, I think an option is required. Incomplete codons are seen at the end of incomplete CDS, presumably due to cloning or sequencing strategy. When there are 'cg' at the end of CDS that are translated to 'R' as any nucleotide would make the codon translate as 'R' It seems the translation are added only if the amino acid can be specified and is not 'X'. -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ bio/data/codontable.rb bioruby-a/lib/bio/data/codontable.rb --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ data/codontable.rb 2008-09-03 22:24:39.000000000 +0900 +++ bioruby-a/lib/bio/data/codontable.rb 2008-09-13 12:06:28.000000000 +0900 @@ -93,6 +93,23 @@ def [](codon) @table[codon] end + def translate_ambiguity(codon, unknown = 'X') + triplet = codon + "NNN" + aa = nil + Bio::NucleicAcid.ambiguity2individual(triplet[2..2]).each do|third| + Bio::NucleicAcid.ambiguity2individual(triplet[0..0]).each do| first| + Bio::NucleicAcid.ambiguity2individual(triplet[1..1]).each do| second| + if aa == nil + aa = @table[first+second+third] + elsif + aa != @table[first+second+third] + return unknown + end + end + end + end + aa + end # Modify the codon table. Use with caution as it may break hard coded # tables. If you want to modify existing table, you should use copy diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ bio/data/na.rb bioruby-a/lib/bio/data/na.rb --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ data/na.rb 2008-09-03 22:24:39.000000000 +0900 +++ bioruby-a/lib/bio/data/na.rb 2008-09-13 12:06:28.000000000 +0900 @@ -182,6 +182,13 @@ end Regexp.new(str) end + def ambiguity2individual(na, rna = false) + str = NAMES[na.downcase].gsub(/[\[\]]/,"") + if rna + str.tr!("t", "u") + end + str.split(//) + end end diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ bio/sequence/na.rb bioruby-a/lib/bio/sequence/na.rb --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ sequence/na.rb 2008-09-03 22:24:39.000000000 +0900 +++ bioruby-a/lib/bio/sequence/na.rb 2008-09-15 18:57:19.000000000 +0900 @@ -231,7 +231,7 @@ # (default 1) # * (optional) _unknown_: Character (default 'X') # *Returns*:: Bio::Sequence::AA object - def translate(frame = 1, table = 1, unknown = 'X') + def translate(frame = 1, table = 1, unknown = 'X', check_start = false) if table.is_a?(Bio::CodonTable) ct = table else @@ -251,8 +251,19 @@ from = 0 end nalen = naseq.length - from - nalen -= nalen % 3 - aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or unknown} +# nalen -= nalen % 3 + if check_start and from == 0 and ct.start_codon?(naseq[0, 3]) + if nalen > 3 + aaseq = "M" + naseq[from+3, nalen-3].gsub(/.{1,3}/) {|codon| ct[codon] or ct.translate_ambiguity(codon, unknown)} + else + aaseq = "M" + end + else + aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct[codon] or ct.translate_ambiguity(codon, unknown)} + end + if nalen % 3 != 0 + aaseq.sub!(/X$/,"") + end return Bio::Sequence::AA.new(aaseq) end From ktym at hgc.jp Mon Sep 15 12:12:52 2008 From: ktym at hgc.jp (Toshiaki Katayama) Date: Mon, 15 Sep 2008 21:12:52 +0900 Subject: [BioRuby] Translate ambiguous sequence In-Reply-To: References: <9E7111E0-F2DC-485D-B829-C7BD116517CD@kenroku.kanazawa-u.ac.jp> Message-ID: <8D8CEB76-06C4-4474-95B2-B0BD9AD38E03@hgc.jp> Hi, * check_start As you suggested, the codon table object (Bio::CodonTable) holds a list of start codons as a knowledge, but Bio::Sequence::NA#translate method does not utilize it (it is also true for the stop codons). lib/bio/data/codontable.rb: ------------------------------------------------------------ # Create your own codon table by giving a Hash table of codons and relevant # amino acids. You can also able to define the table's name as a second # argument. # # Two Arrays 'start' and 'stop' can be specified which contains a list of # start and stop codons used by 'start_codon?' and 'stop_codon?' methods. def initialize(hash, definition = nil, start = [], stop = []) @table = hash @definition = definition @start = start @stop = stop.empty? ? generate_stop : stop end ------------------------------------------------------------ So, the following your code should be included in someway (but I prefer to set check_start = true by default; and use 'first_codon' variable explicitly instead of naseq[0, 3]). ------------------------------------------------------------ + if check_start and from == 0 and ct.start_codon?(naseq[0, 3]) ------------------------------------------------------------ * ambiguity As for the ambiguity, your needs seems to be restricted only for the 3' end of the sequence, but there may be demands for translating 'n's in the sequence. As the Bio::Sequence::NA#translate accepts the codon table object of your own as the 2nd argument, and you can copy and override the default codon tables (#1 to #23; or you can define your own codon table from scratch), there may be another approach to define ambiguous translations by your own. ------------------------------------------------------------ your_codon_table = Bio::CodonTable.copy(1) your_codon_table['cgn'] = 'R' your_codon_table['cg'] = 'R' aaseq = naseq.translate(frame, your_codon_table) ------------------------------------------------------------ To do this, we only need to change the following lines lib/bio/sequence/na.rb (translate): ------------------------------------------------------------ nalen -= nalen % 3 aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or unknown} ------------------------------------------------------------ to the below ------------------------------------------------------------ #nalen -= nalen % 3 aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct[codon] or unknown} ------------------------------------------------------------ but may be with a toggle flag to enable/disable this feature. Regards, Toshiaki Katayama On 2008/09/15, at 19:08, Tomoaki NISHIYAMA wrote: > Hi, > > To further make translation compatible what is done between DNA entry and protein > entry in databases, I thought that special treatment of the start codon and > incomplete codons are necessary. > > Special treatment of the start codons are for those codons that is > translated to M only when it is used as the start codon and > a different amino acids if it is used as an internal codon within a CDS. > For example GUG is V if it is internal to the CDS, but it can also serve > as a start codon and in that case it encodes M. > To change the behavior, I think an option is required. > > Incomplete codons are seen at the end of incomplete CDS, presumably due to > cloning or sequencing strategy. > When there are 'cg' at the end of CDS that are translated to 'R' > as any nucleotide would make the codon translate as 'R' > > It seems the translation are added only if the amino acid can be specified and is not 'X'. > -- > Tomoaki NISHIYAMA > > Advanced Science Research Center, > Kanazawa University, > 13-1 Takara-machi, > Kanazawa, 920-0934, Japan > > diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/codontable.rb bioruby-a/lib/bio/data/codontable.rb > --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/codontable.rb 2008-09-03 22:24:39.000000000 +0900 > +++ bioruby-a/lib/bio/data/codontable.rb 2008-09-13 12:06:28.000000000 +0900 > @@ -93,6 +93,23 @@ > def [](codon) > @table[codon] > end > + def translate_ambiguity(codon, unknown = 'X') > + triplet = codon + "NNN" > + aa = nil > + Bio::NucleicAcid.ambiguity2individual(triplet[2..2]).each do|third| > + Bio::NucleicAcid.ambiguity2individual(triplet[0..0]).each do|first| > + Bio::NucleicAcid.ambiguity2individual(triplet[1..1]).each do|second| > + if aa == nil > + aa = @table[first+second+third] > + elsif > + aa != @table[first+second+third] > + return unknown > + end > + end > + end > + end > + aa > + end > > # Modify the codon table. Use with caution as it may break hard coded > # tables. If you want to modify existing table, you should use copy > diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/na.rb bioruby-a/lib/bio/data/na.rb > --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/na.rb 2008-09-03 22:24:39.000000000 +0900 > +++ bioruby-a/lib/bio/data/na.rb 2008-09-13 12:06:28.000000000 +0900 > @@ -182,6 +182,13 @@ > end > Regexp.new(str) > end > + def ambiguity2individual(na, rna = false) > + str = NAMES[na.downcase].gsub(/[\[\]]/,"") > + if rna > + str.tr!("t", "u") > + end > + str.split(//) > + end > > end > > diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/sequence/na.rb bioruby-a/lib/bio/sequence/na.rb > --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/sequence/na.rb 2008-09-03 22:24:39.000000000 +0900 > +++ bioruby-a/lib/bio/sequence/na.rb 2008-09-15 18:57:19.000000000 +0900 > @@ -231,7 +231,7 @@ > # (default 1) > # * (optional) _unknown_: Character (default 'X') > # *Returns*:: Bio::Sequence::AA object > - def translate(frame = 1, table = 1, unknown = 'X') > + def translate(frame = 1, table = 1, unknown = 'X', check_start = false) > if table.is_a?(Bio::CodonTable) > ct = table > else > @@ -251,8 +251,19 @@ > from = 0 > end > nalen = naseq.length - from > - nalen -= nalen % 3 > - aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or unknown} > +# nalen -= nalen % 3 > + if check_start and from == 0 and ct.start_codon?(naseq[0, 3]) > + if nalen > 3 > + aaseq = "M" + naseq[from+3, nalen-3].gsub(/.{1,3}/) {|codon| ct[codon] or ct.translate_ambiguity(codon, unknown)} > + else > + aaseq = "M" > + end > + else > + aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct[codon] or ct.translate_ambiguity(codon, unknown)} > + end > + if nalen % 3 != 0 > + aaseq.sub!(/X$/,"") > + end > return Bio::Sequence::AA.new(aaseq) > end > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From tomoakin at kenroku.kanazawa-u.ac.jp Tue Sep 16 03:15:19 2008 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Tue, 16 Sep 2008 12:15:19 +0900 Subject: [BioRuby] Translate ambiguous sequence In-Reply-To: <8D8CEB76-06C4-4474-95B2-B0BD9AD38E03@hgc.jp> References: <9E7111E0-F2DC-485D-B829-C7BD116517CD@kenroku.kanazawa-u.ac.jp> <8D8CEB76-06C4-4474-95B2-B0BD9AD38E03@hgc.jp> Message-ID: <3136C698-6CD5-4749-90BF-514F4E7AFDB7@kenroku.kanazawa-u.ac.jp> Hi, Thank you for comments. > (but I prefer to set check_start = true by default; It was set to false for the default for just not to change the default behavior and is ok to make true for me. If the change of the interface is allowed, I prefer that the unknown be later option, since changing the unknown from 'X' is expected to be very rare, and, in fact, it can be done just a gsub operation without the help of the library. > As for the ambiguity, your needs seems to be restricted > only for the 3' end of the sequence, but there may be demands > for translating 'n's in the sequence. My need is not restricted to the 3' end, and also not restricted to 'N's but there are ten other IUPAC redundant codes. The message on September 11 treated only on these situations (where whole triplet is given but contain an ambiguity code) but not conscious on the start and the 3' end translation of 2 base. I agree that addition of all possible redundant determinate codes to the codon tables is another way to resolve the ambiguity codes. But the table will be quite large to support all the possible combinations for all the tables (at least for human review), and a generator should be written. Expecting that sequences containing ambiguity is rare, I wrote the code that will not impact the efficiency of translating sequence without ambiguity. Apparently the code for ambiguity is quite expensive, but I do not expect translating sequences containing so many ambiguity code that is problematic. (High proportion of ambiguity in itself is ok if the sequence is not very long). -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan On 2008/09/15, at 21:12, Toshiaki Katayama wrote: > Hi, > > * check_start > > As you suggested, the codon table object (Bio::CodonTable) holds a > list of > start codons as a knowledge, but Bio::Sequence::NA#translate method > does not > utilize it (it is also true for the stop codons). > > lib/bio/data/codontable.rb: > ------------------------------------------------------------ > # Create your own codon table by giving a Hash table of codons > and relevant > # amino acids. You can also able to define the table's name as a > second > # argument. > # > # Two Arrays 'start' and 'stop' can be specified which contains a > list of > # start and stop codons used by 'start_codon?' and 'stop_codon?' > methods. > def initialize(hash, definition = nil, start = [], stop = []) > @table = hash > @definition = definition > @start = start > @stop = stop.empty? ? generate_stop : stop > end > ------------------------------------------------------------ > > So, the following your code should be included in someway > (but I prefer to set check_start = true by default; and > use 'first_codon' variable explicitly instead of naseq[0, 3]). > > ------------------------------------------------------------ > + if check_start and from == 0 and ct.start_codon?(naseq[0, 3]) > ------------------------------------------------------------ > > > * ambiguity > > As for the ambiguity, your needs seems to be restricted > only for the 3' end of the sequence, but there may be demands > for translating 'n's in the sequence. > > As the Bio::Sequence::NA#translate accepts the codon table object > of your own as the 2nd argument, and you can copy and override > the default codon tables (#1 to #23; or you can define your own > codon table from scratch), there may be another approach to define > ambiguous translations by your own. > > ------------------------------------------------------------ > your_codon_table = Bio::CodonTable.copy(1) > your_codon_table['cgn'] = 'R' > your_codon_table['cg'] = 'R' > > aaseq = naseq.translate(frame, your_codon_table) > ------------------------------------------------------------ > > To do this, we only need to change the following lines > > lib/bio/sequence/na.rb (translate): > ------------------------------------------------------------ > nalen -= nalen % 3 > aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or > unknown} > ------------------------------------------------------------ > > to the below > > ------------------------------------------------------------ > #nalen -= nalen % 3 > aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct[codon] or > unknown} > ------------------------------------------------------------ > > but may be with a toggle flag to enable/disable this feature. > > Regards, > Toshiaki Katayama > > > > On 2008/09/15, at 19:08, Tomoaki NISHIYAMA wrote: > >> Hi, >> >> To further make translation compatible what is done between DNA >> entry and protein >> entry in databases, I thought that special treatment of the start >> codon and >> incomplete codons are necessary. >> >> Special treatment of the start codons are for those codons that is >> translated to M only when it is used as the start codon and >> a different amino acids if it is used as an internal codon within >> a CDS. >> For example GUG is V if it is internal to the CDS, but it can also >> serve >> as a start codon and in that case it encodes M. >> To change the behavior, I think an option is required. >> >> Incomplete codons are seen at the end of incomplete CDS, >> presumably due to >> cloning or sequencing strategy. >> When there are 'cg' at the end of CDS that are translated to 'R' >> as any nucleotide would make the codon translate as 'R' >> >> It seems the translation are added only if the amino acid can be >> specified and is not 'X'. >> -- >> Tomoaki NISHIYAMA >> >> Advanced Science Research Center, >> Kanazawa University, >> 13-1 Takara-machi, >> Kanazawa, 920-0934, Japan >> >> diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/ >> lib/bio/data/codontable.rb bioruby-a/lib/bio/data/codontable.rb >> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ >> bio/data/codontable.rb 2008-09-03 22:24:39.000000000 +0900 >> +++ bioruby-a/lib/bio/data/codontable.rb 2008-09-13 >> 12:06:28.000000000 +0900 >> @@ -93,6 +93,23 @@ >> def [](codon) >> @table[codon] >> end >> + def translate_ambiguity(codon, unknown = 'X') >> + triplet = codon + "NNN" >> + aa = nil >> + Bio::NucleicAcid.ambiguity2individual(triplet[2..2]).each do| >> third| >> + Bio::NucleicAcid.ambiguity2individual(triplet[0..0]).each >> do|first| >> + Bio::NucleicAcid.ambiguity2individual(triplet[1..1]).each >> do|second| >> + if aa == nil >> + aa = @table[first+second+third] >> + elsif >> + aa != @table[first+second+third] >> + return unknown >> + end >> + end >> + end >> + end >> + aa >> + end >> >> # Modify the codon table. Use with caution as it may break hard >> coded >> # tables. If you want to modify existing table, you should use >> copy >> diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/ >> lib/bio/data/na.rb bioruby-a/lib/bio/data/na.rb >> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ >> bio/data/na.rb 2008-09-03 22:24:39.000000000 +0900 >> +++ bioruby-a/lib/bio/data/na.rb 2008-09-13 >> 12:06:28.000000000 +0900 >> @@ -182,6 +182,13 @@ >> end >> Regexp.new(str) >> end >> + def ambiguity2individual(na, rna = false) >> + str = NAMES[na.downcase].gsub(/[\[\]]/,"") >> + if rna >> + str.tr!("t", "u") >> + end >> + str.split(//) >> + end >> >> end >> >> diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/ >> lib/bio/sequence/na.rb bioruby-a/lib/bio/sequence/na.rb >> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ >> bio/sequence/na.rb 2008-09-03 22:24:39.000000000 +0900 >> +++ bioruby-a/lib/bio/sequence/na.rb 2008-09-15 >> 18:57:19.000000000 +0900 >> @@ -231,7 +231,7 @@ >> # (default 1) >> # * (optional) _unknown_: Character (default 'X') >> # *Returns*:: Bio::Sequence::AA object >> - def translate(frame = 1, table = 1, unknown = 'X') >> + def translate(frame = 1, table = 1, unknown = 'X', check_start >> = false) >> if table.is_a?(Bio::CodonTable) >> ct = table >> else >> @@ -251,8 +251,19 @@ >> from = 0 >> end >> nalen = naseq.length - from >> - nalen -= nalen % 3 >> - aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or >> unknown} >> +# nalen -= nalen % 3 >> + if check_start and from == 0 and ct.start_codon?(naseq[0, 3]) >> + if nalen > 3 >> + aaseq = "M" + naseq[from+3, nalen-3].gsub(/.{1,3}/) {| >> codon| ct[codon] or ct.translate_ambiguity(codon, unknown)} >> + else >> + aaseq = "M" >> + end >> + else >> + aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct >> [codon] or ct.translate_ambiguity(codon, unknown)} >> + end >> + if nalen % 3 != 0 >> + aaseq.sub!(/X$/,"") >> + end >> return Bio::Sequence::AA.new(aaseq) >> end >> >> >> _______________________________________________ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From ktym at hgc.jp Tue Sep 16 04:56:14 2008 From: ktym at hgc.jp (Toshiaki Katayama) Date: Tue, 16 Sep 2008 13:56:14 +0900 Subject: [BioRuby] Translate ambiguous sequence In-Reply-To: <3136C698-6CD5-4749-90BF-514F4E7AFDB7@kenroku.kanazawa-u.ac.jp> References: <9E7111E0-F2DC-485D-B829-C7BD116517CD@kenroku.kanazawa-u.ac.jp> <8D8CEB76-06C4-4474-95B2-B0BD9AD38E03@hgc.jp> <3136C698-6CD5-4749-90BF-514F4E7AFDB7@kenroku.kanazawa-u.ac.jp> Message-ID: Hi, > It was set to false for the default for just not to > change the default behavior and is ok to make true for me. I just thought that if the main application of the 'translate' method is to translate gene to protein sequence, current implementation is incomplete and should be changed. If not, retain the current behavior may be better. > If the change of the interface is allowed, > I prefer that the unknown be later option, since > changing the unknown from 'X' is expected to be very rare, > and, in fact, it can be done just a gsub operation without > the help of the library. I can agree (don't know how others think, though). Another option is to provide different methods (interfaces) for considering start/stop codons and ambiguous bases. Or introduce named options... > My need is not restricted to the 3' end, and also not restricted to > 'N's but there are ten other IUPAC redundant codes. Sorry, I misunderstood your code. You are trying to translate all possible combinations of the ambiguous bases on the fly. Your code is fine and followings are just for discussion: Is there no efficient way to statically generate a reduction of the given codon table considering ambiguous bases...? Your implementation seems to return 'unknown' if the translation of the codon containing ambiguous bases are translated to the different amino acid, however, the comparison occurs every time when the codon is passed to the 'translate_ambiguity' method. It would be helpful to know how many patterns needed to be generated to match codons with ambiguous bases for 20 amino acids. Is it possible to rewrite current Bio::CodonTable implementation to utilize Regexp as a key for the codon table hash for this purpose? Regards, Toshiaki Katayama On 2008/09/16, at 12:15, Tomoaki NISHIYAMA wrote: > Hi, > > Thank you for comments. > > (but I prefer to set check_start = true by default; > It was set to false for the default for just not to > change the default behavior and is ok to make true for me. > If the change of the interface is allowed, > I prefer that the unknown be later option, since > changing the unknown from 'X' is expected to be very rare, > and, in fact, it can be done just a gsub operation without > the help of the library. > >> As for the ambiguity, your needs seems to be restricted >> only for the 3' end of the sequence, but there may be demands >> for translating 'n's in the sequence. > > > My need is not restricted to the 3' end, and also not restricted to > 'N's but there are ten other IUPAC redundant codes. > The message on September 11 treated only on these situations > (where whole triplet is given but contain an ambiguity code) > but not conscious on the start and the 3' end translation of 2 base. > > I agree that addition of all possible redundant determinate codes to the codon tables > is another way to resolve the ambiguity codes. > But the table will be quite large to support all the possible > combinations for all the tables (at least for human review), > and a generator should be written. > Expecting that sequences containing ambiguity is rare, I wrote the code that will > not impact the efficiency of translating sequence without ambiguity. > Apparently the code for ambiguity is quite expensive, but I do not expect translating > sequences containing so many ambiguity code that is problematic. > (High proportion of ambiguity in itself is ok if the sequence is not very long). > -- > Tomoaki NISHIYAMA > > Advanced Science Research Center, > Kanazawa University, > 13-1 Takara-machi, > Kanazawa, 920-0934, Japan > > > On 2008/09/15, at 21:12, Toshiaki Katayama wrote: > >> Hi, >> >> * check_start >> >> As you suggested, the codon table object (Bio::CodonTable) holds a list of >> start codons as a knowledge, but Bio::Sequence::NA#translate method does not >> utilize it (it is also true for the stop codons). >> >> lib/bio/data/codontable.rb: >> ------------------------------------------------------------ >> # Create your own codon table by giving a Hash table of codons and relevant >> # amino acids. You can also able to define the table's name as a second >> # argument. >> # >> # Two Arrays 'start' and 'stop' can be specified which contains a list of >> # start and stop codons used by 'start_codon?' and 'stop_codon?' methods. >> def initialize(hash, definition = nil, start = [], stop = []) >> @table = hash >> @definition = definition >> @start = start >> @stop = stop.empty? ? generate_stop : stop >> end >> ------------------------------------------------------------ >> >> So, the following your code should be included in someway >> (but I prefer to set check_start = true by default; and >> use 'first_codon' variable explicitly instead of naseq[0, 3]). >> >> ------------------------------------------------------------ >> + if check_start and from == 0 and ct.start_codon?(naseq[0, 3]) >> ------------------------------------------------------------ >> >> >> * ambiguity >> >> As for the ambiguity, your needs seems to be restricted >> only for the 3' end of the sequence, but there may be demands >> for translating 'n's in the sequence. >> >> As the Bio::Sequence::NA#translate accepts the codon table object >> of your own as the 2nd argument, and you can copy and override >> the default codon tables (#1 to #23; or you can define your own >> codon table from scratch), there may be another approach to define >> ambiguous translations by your own. >> >> ------------------------------------------------------------ >> your_codon_table = Bio::CodonTable.copy(1) >> your_codon_table['cgn'] = 'R' >> your_codon_table['cg'] = 'R' >> >> aaseq = naseq.translate(frame, your_codon_table) >> ------------------------------------------------------------ >> >> To do this, we only need to change the following lines >> >> lib/bio/sequence/na.rb (translate): >> ------------------------------------------------------------ >> nalen -= nalen % 3 >> aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or unknown} >> ------------------------------------------------------------ >> >> to the below >> >> ------------------------------------------------------------ >> #nalen -= nalen % 3 >> aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct[codon] or unknown} >> ------------------------------------------------------------ >> >> but may be with a toggle flag to enable/disable this feature. >> >> Regards, >> Toshiaki Katayama >> >> >> >> On 2008/09/15, at 19:08, Tomoaki NISHIYAMA wrote: >> >>> Hi, >>> >>> To further make translation compatible what is done between DNA entry and protein >>> entry in databases, I thought that special treatment of the start codon and >>> incomplete codons are necessary. >>> >>> Special treatment of the start codons are for those codons that is >>> translated to M only when it is used as the start codon and >>> a different amino acids if it is used as an internal codon within a CDS. >>> For example GUG is V if it is internal to the CDS, but it can also serve >>> as a start codon and in that case it encodes M. >>> To change the behavior, I think an option is required. >>> >>> Incomplete codons are seen at the end of incomplete CDS, presumably due to >>> cloning or sequencing strategy. >>> When there are 'cg' at the end of CDS that are translated to 'R' >>> as any nucleotide would make the codon translate as 'R' >>> >>> It seems the translation are added only if the amino acid can be specified and is not 'X'. >>> -- >>> Tomoaki NISHIYAMA >>> >>> Advanced Science Research Center, >>> Kanazawa University, >>> 13-1 Takara-machi, >>> Kanazawa, 920-0934, Japan >>> >>> diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/codontable.rb bioruby-a/lib/bio/data/codontable.rb >>> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/codontable.rb 2008-09-03 22:24:39.000000000 +0900 >>> +++ bioruby-a/lib/bio/data/codontable.rb 2008-09-13 12:06:28.000000000 +0900 >>> @@ -93,6 +93,23 @@ >>> def [](codon) >>> @table[codon] >>> end >>> + def translate_ambiguity(codon, unknown = 'X') >>> + triplet = codon + "NNN" >>> + aa = nil >>> + Bio::NucleicAcid.ambiguity2individual(triplet[2..2]).each do|third| >>> + Bio::NucleicAcid.ambiguity2individual(triplet[0..0]).each do|first| >>> + Bio::NucleicAcid.ambiguity2individual(triplet[1..1]).each do|second| >>> + if aa == nil >>> + aa = @table[first+second+third] >>> + elsif >>> + aa != @table[first+second+third] >>> + return unknown >>> + end >>> + end >>> + end >>> + end >>> + aa >>> + end >>> >>> # Modify the codon table. Use with caution as it may break hard coded >>> # tables. If you want to modify existing table, you should use copy >>> diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/na.rb bioruby-a/lib/bio/data/na.rb >>> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/na.rb 2008-09-03 22:24:39.000000000 +0900 >>> +++ bioruby-a/lib/bio/data/na.rb 2008-09-13 12:06:28.000000000 +0900 >>> @@ -182,6 +182,13 @@ >>> end >>> Regexp.new(str) >>> end >>> + def ambiguity2individual(na, rna = false) >>> + str = NAMES[na.downcase].gsub(/[\[\]]/,"") >>> + if rna >>> + str.tr!("t", "u") >>> + end >>> + str.split(//) >>> + end >>> >>> end >>> >>> diff -ru bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/sequence/na.rb bioruby-a/lib/bio/sequence/na.rb >>> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/sequence/na.rb 2008-09-03 22:24:39.000000000 +0900 >>> +++ bioruby-a/lib/bio/sequence/na.rb 2008-09-15 18:57:19.000000000 +0900 >>> @@ -231,7 +231,7 @@ >>> # (default 1) >>> # * (optional) _unknown_: Character (default 'X') >>> # *Returns*:: Bio::Sequence::AA object >>> - def translate(frame = 1, table = 1, unknown = 'X') >>> + def translate(frame = 1, table = 1, unknown = 'X', check_start = false) >>> if table.is_a?(Bio::CodonTable) >>> ct = table >>> else >>> @@ -251,8 +251,19 @@ >>> from = 0 >>> end >>> nalen = naseq.length - from >>> - nalen -= nalen % 3 >>> - aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or unknown} >>> +# nalen -= nalen % 3 >>> + if check_start and from == 0 and ct.start_codon?(naseq[0, 3]) >>> + if nalen > 3 >>> + aaseq = "M" + naseq[from+3, nalen-3].gsub(/.{1,3}/) {|codon| ct[codon] or ct.translate_ambiguity(codon, unknown)} >>> + else >>> + aaseq = "M" >>> + end >>> + else >>> + aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct[codon] or ct.translate_ambiguity(codon, unknown)} >>> + end >>> + if nalen % 3 != 0 >>> + aaseq.sub!(/X$/,"") >>> + end >>> return Bio::Sequence::AA.new(aaseq) >>> end >>> >>> >>> _______________________________________________ >>> BioRuby mailing list >>> BioRuby at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioruby >> >> >> _______________________________________________ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby >> > From ngoto at gen-info.osaka-u.ac.jp Tue Sep 16 05:12:31 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Tue, 16 Sep 2008 14:12:31 +0900 Subject: [BioRuby] Translate ambiguous sequence In-Reply-To: References: <9E7111E0-F2DC-485D-B829-C7BD116517CD@kenroku.kanazawa-u.ac.jp> <8D8CEB76-06C4-4474-95B2-B0BD9AD38E03@hgc.jp> <3136C698-6CD5-4749-90BF-514F4E7AFDB7@kenroku.kanazawa-u.ac.jp> Message-ID: <20080916051231.E52721CBC4F5@idnmail.gen-info.osaka-u.ac.jp> On Tue, 16 Sep 2008 13:56:14 +0900 Toshiaki Katayama wrote: > Hi, > > > It was set to false for the default for just not to > > change the default behavior and is ok to make true for me. > > I just thought that if the main application of the 'translate' > method is to translate gene to protein sequence, current > implementation is incomplete and should be changed. > If not, retain the current behavior may be better. I'm using the "translate" not only for whole genes, but also for partial sequences and/or sequences with unknown start positions. So, I don't want to change the default. -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From tomoakin at kenroku.kanazawa-u.ac.jp Tue Sep 16 06:38:37 2008 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Tue, 16 Sep 2008 15:38:37 +0900 Subject: [BioRuby] Translate ambiguous sequence In-Reply-To: References: <9E7111E0-F2DC-485D-B829-C7BD116517CD@kenroku.kanazawa-u.ac.jp> <8D8CEB76-06C4-4474-95B2-B0BD9AD38E03@hgc.jp> <3136C698-6CD5-4749-90BF-514F4E7AFDB7@kenroku.kanazawa-u.ac.jp> Message-ID: <5712116A-2F5D-460D-8557-896A83B2861E@kenroku.kanazawa-u.ac.jp> Hi, > Is there no efficient way to statically generate a reduction of > the given codon table considering ambiguous bases...? > > Your implementation seems to return 'unknown' if the translation of > the codon containing ambiguous bases are translated to the different > amino acid, however, the comparison occurs every time when the codon > is passed to the 'translate_ambiguity' method. > > It would be helpful to know how many patterns needed to be generated > to match codons with ambiguous bases for 20 amino acids. Generation of the hash in itself is not very difficult, (just iterate over all the possible triplet and dinucleotides, with some assumption on the table) and 174-195 keys are sufficient for each of preexisting codon tables. (for 20 amino acids plus '*') The benefit is usually quite low as there are little ambiguity in the DNA sequence (because low quality regions are deleted at an earlier process). The hash might worth included for standard codontables when someone are to directly process a large quantity of poor quality sequence data. (Maybe 454 or Solexa?) For codontable object that are copied and modified, I expect there are little cases when the cost to generate that table for ambiguity treatment is smaller than the on the fly comparison. #!/usr/local/bin/ruby require 'bio' dnanucleotides = ['a', 'c', 'g', 't', 'y', 'r', 'w', 's', 'k', 'm', 'b', 'd', 'h', 'v', 'n'] tableary=Array.new [1, 2, 3, 4, 5, 6, 9, 10, 11, 12, 13, 14, 15, 16, 21, 22, 23].each do |tableno| partialhash = Hash.new dnanucleotides.each do |first| dnanucleotides.each do |second| dnaseq = Bio::Sequence::NA.new(first + second) transl = dnaseq.translate(1,tableno) if transl != 'X' and transl != "" partialhash[dnaseq] = transl end dnanucleotides.each do |third| dnaseq = Bio::Sequence::NA.new(first + second + third) transl = dnaseq.translate(1,tableno) if transl != 'X' partialhash[dnaseq] = transl end end end end puts "table#{tableno}: #{partialhash.size} patterns" # p partialhash end -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan On 2008/09/16, at 13:56, Toshiaki Katayama wrote: > Hi, > >> It was set to false for the default for just not to >> change the default behavior and is ok to make true for me. > > I just thought that if the main application of the 'translate' > method is to translate gene to protein sequence, current > implementation is incomplete and should be changed. > If not, retain the current behavior may be better. > >> If the change of the interface is allowed, >> I prefer that the unknown be later option, since >> changing the unknown from 'X' is expected to be very rare, >> and, in fact, it can be done just a gsub operation without >> the help of the library. > > I can agree (don't know how others think, though). > Another option is to provide different methods (interfaces) > for considering start/stop codons and ambiguous bases. > Or introduce named options... > >> My need is not restricted to the 3' end, and also not restricted to >> 'N's but there are ten other IUPAC redundant codes. > > Sorry, I misunderstood your code. > > You are trying to translate all possible combinations of the ambiguous > bases on the fly. > > Your code is fine and followings are just for discussion: > > Is there no efficient way to statically generate a reduction of > the given codon table considering ambiguous bases...? > > Your implementation seems to return 'unknown' if the translation of > the codon containing ambiguous bases are translated to the different > amino acid, however, the comparison occurs every time when the codon > is passed to the 'translate_ambiguity' method. > > It would be helpful to know how many patterns needed to be generated > to match codons with ambiguous bases for 20 amino acids. > > Is it possible to rewrite current Bio::CodonTable implementation > to utilize Regexp as a key for the codon table hash for this purpose? > > Regards, > Toshiaki Katayama > > > On 2008/09/16, at 12:15, Tomoaki NISHIYAMA wrote: > >> Hi, >> >> Thank you for comments. >>> (but I prefer to set check_start = true by default; >> It was set to false for the default for just not to >> change the default behavior and is ok to make true for me. >> If the change of the interface is allowed, >> I prefer that the unknown be later option, since >> changing the unknown from 'X' is expected to be very rare, >> and, in fact, it can be done just a gsub operation without >> the help of the library. >> >>> As for the ambiguity, your needs seems to be restricted >>> only for the 3' end of the sequence, but there may be demands >>> for translating 'n's in the sequence. >> >> >> My need is not restricted to the 3' end, and also not restricted to >> 'N's but there are ten other IUPAC redundant codes. >> The message on September 11 treated only on these situations >> (where whole triplet is given but contain an ambiguity code) >> but not conscious on the start and the 3' end translation of 2 base. >> >> I agree that addition of all possible redundant determinate codes >> to the codon tables >> is another way to resolve the ambiguity codes. >> But the table will be quite large to support all the possible >> combinations for all the tables (at least for human review), >> and a generator should be written. >> Expecting that sequences containing ambiguity is rare, I wrote the >> code that will >> not impact the efficiency of translating sequence without ambiguity. >> Apparently the code for ambiguity is quite expensive, but I do not >> expect translating >> sequences containing so many ambiguity code that is problematic. >> (High proportion of ambiguity in itself is ok if the sequence is >> not very long). >> -- >> Tomoaki NISHIYAMA >> >> Advanced Science Research Center, >> Kanazawa University, >> 13-1 Takara-machi, >> Kanazawa, 920-0934, Japan >> >> >> On 2008/09/15, at 21:12, Toshiaki Katayama wrote: >> >>> Hi, >>> >>> * check_start >>> >>> As you suggested, the codon table object (Bio::CodonTable) holds >>> a list of >>> start codons as a knowledge, but Bio::Sequence::NA#translate >>> method does not >>> utilize it (it is also true for the stop codons). >>> >>> lib/bio/data/codontable.rb: >>> ------------------------------------------------------------ >>> # Create your own codon table by giving a Hash table of codons >>> and relevant >>> # amino acids. You can also able to define the table's name as >>> a second >>> # argument. >>> # >>> # Two Arrays 'start' and 'stop' can be specified which contains >>> a list of >>> # start and stop codons used by 'start_codon?' and 'stop_codon?' >>> methods. >>> def initialize(hash, definition = nil, start = [], stop = []) >>> @table = hash >>> @definition = definition >>> @start = start >>> @stop = stop.empty? ? generate_stop : stop >>> end >>> ------------------------------------------------------------ >>> >>> So, the following your code should be included in someway >>> (but I prefer to set check_start = true by default; and >>> use 'first_codon' variable explicitly instead of naseq[0, 3]). >>> >>> ------------------------------------------------------------ >>> + if check_start and from == 0 and ct.start_codon?(naseq[0, 3]) >>> ------------------------------------------------------------ >>> >>> >>> * ambiguity >>> >>> As for the ambiguity, your needs seems to be restricted >>> only for the 3' end of the sequence, but there may be demands >>> for translating 'n's in the sequence. >>> >>> As the Bio::Sequence::NA#translate accepts the codon table object >>> of your own as the 2nd argument, and you can copy and override >>> the default codon tables (#1 to #23; or you can define your own >>> codon table from scratch), there may be another approach to define >>> ambiguous translations by your own. >>> >>> ------------------------------------------------------------ >>> your_codon_table = Bio::CodonTable.copy(1) >>> your_codon_table['cgn'] = 'R' >>> your_codon_table['cg'] = 'R' >>> >>> aaseq = naseq.translate(frame, your_codon_table) >>> ------------------------------------------------------------ >>> >>> To do this, we only need to change the following lines >>> >>> lib/bio/sequence/na.rb (translate): >>> ------------------------------------------------------------ >>> nalen -= nalen % 3 >>> aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] or >>> unknown} >>> ------------------------------------------------------------ >>> >>> to the below >>> >>> ------------------------------------------------------------ >>> #nalen -= nalen % 3 >>> aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct[codon] or >>> unknown} >>> ------------------------------------------------------------ >>> >>> but may be with a toggle flag to enable/disable this feature. >>> >>> Regards, >>> Toshiaki Katayama >>> >>> >>> >>> On 2008/09/15, at 19:08, Tomoaki NISHIYAMA wrote: >>> >>>> Hi, >>>> >>>> To further make translation compatible what is done between DNA >>>> entry and protein >>>> entry in databases, I thought that special treatment of the >>>> start codon and >>>> incomplete codons are necessary. >>>> >>>> Special treatment of the start codons are for those codons that is >>>> translated to M only when it is used as the start codon and >>>> a different amino acids if it is used as an internal codon >>>> within a CDS. >>>> For example GUG is V if it is internal to the CDS, but it can >>>> also serve >>>> as a start codon and in that case it encodes M. >>>> To change the behavior, I think an option is required. >>>> >>>> Incomplete codons are seen at the end of incomplete CDS, >>>> presumably due to >>>> cloning or sequencing strategy. >>>> When there are 'cg' at the end of CDS that are translated to 'R' >>>> as any nucleotide would make the codon translate as 'R' >>>> >>>> It seems the translation are added only if the amino acid can be >>>> specified and is not 'X'. >>>> -- >>>> Tomoaki NISHIYAMA >>>> >>>> Advanced Science Research Center, >>>> Kanazawa University, >>>> 13-1 Takara-machi, >>>> Kanazawa, 920-0934, Japan >>>> >>>> diff -ru bioruby- >>>> bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/ >>>> codontable.rb bioruby-a/lib/bio/data/codontable.rb >>>> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ >>>> bio/data/codontable.rb 2008-09-03 22:24:39.000000000 +0900 >>>> +++ bioruby-a/lib/bio/data/codontable.rb 2008-09-13 >>>> 12:06:28.000000000 +0900 >>>> @@ -93,6 +93,23 @@ >>>> def [](codon) >>>> @table[codon] >>>> end >>>> + def translate_ambiguity(codon, unknown = 'X') >>>> + triplet = codon + "NNN" >>>> + aa = nil >>>> + Bio::NucleicAcid.ambiguity2individual(triplet[2..2]).each >>>> do|third| >>>> + Bio::NucleicAcid.ambiguity2individual(triplet[0..0]).each >>>> do|first| >>>> + Bio::NucleicAcid.ambiguity2individual(triplet >>>> [1..1]).each do|second| >>>> + if aa == nil >>>> + aa = @table[first+second+third] >>>> + elsif >>>> + aa != @table[first+second+third] >>>> + return unknown >>>> + end >>>> + end >>>> + end >>>> + end >>>> + aa >>>> + end >>>> >>>> # Modify the codon table. Use with caution as it may break >>>> hard coded >>>> # tables. If you want to modify existing table, you should use >>>> copy >>>> diff -ru bioruby- >>>> bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/data/ >>>> na.rb bioruby-a/lib/bio/data/na.rb >>>> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ >>>> bio/data/na.rb 2008-09-03 22:24:39.000000000 +0900 >>>> +++ bioruby-a/lib/bio/data/na.rb 2008-09-13 >>>> 12:06:28.000000000 +0900 >>>> @@ -182,6 +182,13 @@ >>>> end >>>> Regexp.new(str) >>>> end >>>> + def ambiguity2individual(na, rna = false) >>>> + str = NAMES[na.downcase].gsub(/[\[\]]/,"") >>>> + if rna >>>> + str.tr!("t", "u") >>>> + end >>>> + str.split(//) >>>> + end >>>> >>>> end >>>> >>>> diff -ru bioruby- >>>> bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/bio/ >>>> sequence/na.rb bioruby-a/lib/bio/sequence/na.rb >>>> --- bioruby-bioruby-1440b766202a2b66ac7386b9b46928834a9c9873/lib/ >>>> bio/sequence/na.rb 2008-09-03 22:24:39.000000000 +0900 >>>> +++ bioruby-a/lib/bio/sequence/na.rb 2008-09-15 >>>> 18:57:19.000000000 +0900 >>>> @@ -231,7 +231,7 @@ >>>> # (default 1) >>>> # * (optional) _unknown_: Character (default 'X') >>>> # *Returns*:: Bio::Sequence::AA object >>>> - def translate(frame = 1, table = 1, unknown = 'X') >>>> + def translate(frame = 1, table = 1, unknown = 'X', >>>> check_start = false) >>>> if table.is_a?(Bio::CodonTable) >>>> ct = table >>>> else >>>> @@ -251,8 +251,19 @@ >>>> from = 0 >>>> end >>>> nalen = naseq.length - from >>>> - nalen -= nalen % 3 >>>> - aaseq = naseq[from, nalen].gsub(/.{3}/) {|codon| ct[codon] >>>> or unknown} >>>> +# nalen -= nalen % 3 >>>> + if check_start and from == 0 and ct.start_codon?(naseq[0, 3]) >>>> + if nalen > 3 >>>> + aaseq = "M" + naseq[from+3, nalen-3].gsub(/.{1,3}/) {| >>>> codon| ct[codon] or ct.translate_ambiguity(codon, unknown)} >>>> + else >>>> + aaseq = "M" >>>> + end >>>> + else >>>> + aaseq = naseq[from, nalen].gsub(/.{1,3}/) {|codon| ct >>>> [codon] or ct.translate_ambiguity(codon, unknown)} >>>> + end >>>> + if nalen % 3 != 0 >>>> + aaseq.sub!(/X$/,"") >>>> + end >>>> return Bio::Sequence::AA.new(aaseq) >>>> end >>>> >>>> >>>> _______________________________________________ >>>> BioRuby mailing list >>>> BioRuby at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioruby >>> >>> >>> _______________________________________________ >>> BioRuby mailing list >>> BioRuby at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioruby >>> >> > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From sgujja at broad.mit.edu Tue Sep 16 19:34:11 2008 From: sgujja at broad.mit.edu (Sharvari Gujja) Date: Tue, 16 Sep 2008 15:34:11 -0400 Subject: [BioRuby] Bio::Blast::RPSBlast::Report Message-ID: <48D00A33.2050906@broad.mit.edu> Hi, Can someone please direct me to Bio::Blast::RPSBlast::Report documentation/examples ? Thanks S From ngoto at gen-info.osaka-u.ac.jp Wed Sep 17 02:44:28 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 17 Sep 2008 11:44:28 +0900 Subject: [BioRuby] Bio::Blast::RPSBlast::Report In-Reply-To: <48D00A33.2050906@broad.mit.edu> References: <48D00A33.2050906@broad.mit.edu> Message-ID: <20080917024429.0B3A91501EF@idnmail.gen-info.osaka-u.ac.jp> On Tue, 16 Sep 2008 15:34:11 -0400 Sharvari Gujja wrote: > Hi, > > Can someone please direct me to Bio::Blast::RPSBlast::Report > documentation/examples ? http://lists.open-bio.org/pipermail/bioruby/2008-April/000624.html Note that the Bio::Blast::RPSBlast::Report exists still only in development version, and the spec and usage would be changed before the release version in near future. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From ngoto at gen-info.osaka-u.ac.jp Wed Sep 17 03:56:19 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 17 Sep 2008 12:56:19 +0900 Subject: [BioRuby] GFF attributes In-Reply-To: References: <20080909114748.555DB1CBC528@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20080917035620.3EEA2150201@idnmail.gen-info.osaka-u.ac.jp> Hi, On Thu, 11 Sep 2008 11:34:36 +0900 Tomoaki NISHIYAMA wrote: > Hi > > > To prevent repeating the bug, I want to use the GFF string > > described in your mail for the test script in BioRuby. > > (test/unit/bio/db/test_gff.rb) > > Can you give permission? > > Surely, I have no objection. > The string is one of the line in the Popular genome annotation from > the JGI site. > ftp://ftp.jgi-psf.org/pub/JGI_data/Poplar/annotation/v1.1/ > Poptr1_1.JamboreeModels.gff.gz > So, I think acknowledging them is a good idea. Thank you. I'll add above URL in the comments of the test. > For test string, I think another pattern including multiple value for > one key is worth to add. > The example from http://www.sanger.ac.uk/Software/formats/GFF/ > GFF_Spec.shtml: > seq1 BLASTX similarity 101 235 87.1 + 0 Target "HBA_HUMAN" 11 > 55 ; E_value 0.0003 > > Perhaps current implementation will return '"HBA_HUMAN" 11 55' as the > value for 'Target'. > But returning an Array ['"HBA_HUMAN"', '11', '55'] may be more > sensible, or represent > more of the meaning of the specification. In this case, string escaping and quotation in free text can also be processed by the class, and [ 'HBA_HUMAN', 11', '55'] can be returned. > Since changing this return value will make incompatibilities, I'm not > sure > whether it can be changed. > But if it is ever to be changed, it is better changed early, or > stated as such. > If it is too late, perhaps we can make a method under a different > name so that > currently working code will not be affected. Indeed, for GFF2 attributes, I've alrealy found a design problem in current Bio::GFF::GFF2#attributes. Currently, a hash is used to store attributes, but the GFF2 spec allows more than two tags with the same name. For example, http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml#homology_feature Align 101 11 ; Align 179 36 ; In this case, with current bioruby implementation, the "Align 101 11" is overwritten by the latter "Align 179 36", and we can only get { "Align" => "179 36" }. To solve the problem, I can think the following two ways. 1. Using an Array to store values from multiple tags. For example, in the above case, @attributes = {} @attribures['Align'] = [ '101 11', '179 36' ] @attribures['Target'] = '"HBA_HUMAN" 11 54' I already took this approach in GFF3 with incompatible changes, because the previous implementation of GFF3#attributes was broken and cannot be used. But now, I just think this approch is not good and I want to change it now, because checking whether the value is an array or not is needed every time. In addition, in this case, we can not parse '"HBA_HUMAN" 11 54' to [ 'HBA_HUMAN', 11', '54'], because it is impossible to distinguish values from multiple tags or parsed values, unless an array is always used. 2. Giving up using hash, and using an array (or possibly a new class e.g. GFF2::Attributes) of [ tag, value ] pairs. For backward compatibility, hash can be dynamically generated when old behavior is requested. I think this approach is better. I'll implement this later. Any comments and suggestions are welcome. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From sgujja at broad.mit.edu Wed Sep 17 14:18:58 2008 From: sgujja at broad.mit.edu (Sharvari Gujja) Date: Wed, 17 Sep 2008 10:18:58 -0400 Subject: [BioRuby] Bio::Blast::RPSBlast::Report In-Reply-To: <20080917024429.0B3A91501EF@idnmail.gen-info.osaka-u.ac.jp> References: <48D00A33.2050906@broad.mit.edu> <20080917024429.0B3A91501EF@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <48D111D2.8010404@broad.mit.edu> Hi, Thank you so much for the info. However, on running the code for rpsblast output parser, I get the following error: *uninitialized constant Bio::Blast::RPSBlast (NameError)* I am not sure what exactly I am missing here. I really appreciate all the help. Thanks S Naohisa GOTO wrote: > On Tue, 16 Sep 2008 15:34:11 -0400 > Sharvari Gujja wrote: > > >> Hi, >> >> Can someone please direct me to Bio::Blast::RPSBlast::Report >> documentation/examples ? >> > > http://lists.open-bio.org/pipermail/bioruby/2008-April/000624.html > > Note that the Bio::Blast::RPSBlast::Report exists still > only in development version, and the spec and usage > would be changed before the release version in near future. > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From ngoto at gen-info.osaka-u.ac.jp Thu Sep 18 03:16:59 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 18 Sep 2008 12:16:59 +0900 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080910074858.GA16861@thebird.nl> References: <20080902065055.GA29634@thebird.nl> <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> <20080902091958.GA31400@thebird.nl> <20080909113816.GA10051@thebird.nl> <20080910014822.36F5F1CBC3D3@idnmail.gen-info.osaka-u.ac.jp> <20080910074858.GA16861@thebird.nl> Message-ID: <20080918031700.074B21CBC498@idnmail.gen-info.osaka-u.ac.jp> Hi Pjotr, If you don't want to implement any access control, using world writable directory like /tmp (comes from ENV['TMPDIR'] or Dir.tmpdir) by default should be disabled, because this is vulnerable to a symbolic link attack. About symbolic link attack, please refer documents: http://www.codeproject.com/KB/web-security/TemporaryFileSecurity.aspx (Note that Ruby's standard TempFile has no problem.) When the "cache" directory isn't explicitly specified by user by using the environment variable BIORUBY_CACHE (or command-line options of custom application), doing without cache should be the default. It is also good to raise SecurityError when the specified directory is writable by everyone. On Wed, 10 Sep 2008 09:48:58 +0200 pjotr2008 at thebird.nl (Pjotr Prins) wrote: > Hi Naohisa, > > Thanks for comments. See below. > > On Wed, Sep 10, 2008 at 10:48:20AM +0900, Naohisa GOTO wrote: > > Hi, > > > > I think the most important thing for cache is data integrity. > > For example, timing for detecting updates of original data, > > controlling accesses and resolving race conditions > > (two or more processes or threads simultaneously want to > > use, update, create, and/or remove the same cache data). > > However, your code only contains directory name determination. > > Well, caching is a universal term for storing stuff intermediately. > And what I need is a place to put files. With regard to race > conditions you are right - if two processes were to download the same > file it would get mangled. However, them being XML the program would > throw an error on parsing. For me that works well enough. For BioRuby > we may need to think of something more universal - and it is not that > hard to do. That is why I wrote my earlier mail. If you want to > support something universal it should be at a higher point in the > source tree. > > But maybe leave it until someone gets an itch to scratch. If the mangled XML was unfortunately syntax valid XML, no obvious error but incorrect data could be obtained. However, now, I believe "that works well enough". Plese write a document in RDoc about the limitation of current implementation when race condition. > > line 24: > > > def set directory, subdir = nil > > > > In def lines, please use parentheses explicitly, > > e.g. def set(directory, subdir = nil), > > because most of existing code in BioRuby does so. > > I like the 'most'. But OK. > > > line 28: > > > dir = dir + '/' + subdir > > > > File.join(dir, subdir) should be used, possibly to support > > non-UNIX systems like Windows. > > OK > > > lines 41 to 45: > > > if cache==nil or cache=='' > > > cache = ENV['TMPDIR'] > > > end > > > cache = '/tmp' if cache==nil or cache=='' > > > set cache, subdir > > > > Using Dir.tmpdir defined in tempdir.rb is better. > > http://www.ruby-doc.org/stdlib/libdoc/tmpdir/rdoc/index.html > > Thanks, > > Pj. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From pjotr2008 at thebird.nl Thu Sep 18 06:32:37 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Thu, 18 Sep 2008 08:32:37 +0200 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080918031700.074B21CBC498@idnmail.gen-info.osaka-u.ac.jp> References: <20080902065055.GA29634@thebird.nl> <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> <20080902091958.GA31400@thebird.nl> <20080909113816.GA10051@thebird.nl> <20080910014822.36F5F1CBC3D3@idnmail.gen-info.osaka-u.ac.jp> <20080910074858.GA16861@thebird.nl> <20080918031700.074B21CBC498@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20080918063237.GA17631@thebird.nl> Hi Naohisa, On Thu, Sep 18, 2008 at 12:16:59PM +0900, Naohisa GOTO wrote: > Hi Pjotr, > > If you don't want to implement any access control, > using world writable directory like /tmp (comes from > ENV['TMPDIR'] or Dir.tmpdir) by default should be disabled, > because this is vulnerable to a symbolic link attack. > > About symbolic link attack, please refer documents: > http://www.codeproject.com/KB/web-security/TemporaryFileSecurity.aspx > (Note that Ruby's standard TempFile has no problem.) I agree - assuming you are running a webservice for microarrays. > When the "cache" directory isn't explicitly specified > by user by using the environment variable BIORUBY_CACHE > (or command-line options of custom application), > doing without cache should be the default. NCBI won't be happy with that. But if that is what Bioruby wants... It is not only about my own bandwidth ;-). > It is also good to raise SecurityError when the specified > directory is writable by everyone. I'll remove tmpdir - I introduced it because of an earlier mail. Disabling the cache is easy - off course. Another option is to use TmpFiles and keep track of those in a Hash (I'd rather not have large IO objects in memory). OK, that is what I'll implement - assuming you want to include the microarray stuff in Bioruby. Pj. From davide.rambaldi at ifom-ieo-campus.it Fri Sep 19 12:49:40 2008 From: davide.rambaldi at ifom-ieo-campus.it (Davide Rambaldi) Date: Fri, 19 Sep 2008 14:49:40 +0200 Subject: [BioRuby] MacRuby Message-ID: May be you already know: MacRuby 0.3 Released with Interface Builder Support By joel at Wed, Sep 17 2008 11:02am |News ? Ruby Inside reports that the nascent MacRuby distribution, an implementation of Ruby 1.9 based on Mac OS X core technologies, has been updated to version 0.3. The most exciting change in this update is the support for Interface Builder and all the Xcode+IB goodness you need to build gorgeous, GUI-based scientific apps for OS X using the ever productive and succinct Ruby language. Also noteworthy is the inclusion of the HotCocoa library, which is somewhat of a domain specific language for working with Cocoa classes from Ruby. Hopefully a number MacRuby + BioRuby mashups will follow on the heels of this exciting development. best regards Davide Rambaldi, Bioinformatics PhD student. ----------------------------------------------------- Bioinformatic Group IFOM-IEO Campus Via Adamello 16, Milano I-20139 Italy [t] +39 02574303 066 [e] davide.rambaldi at ifom-ieo-campus.it [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/DavideRambaldi (homepage) [i] http://www.semm.it (PhD school) [i] http://www.btbs.unimib.it/ (Master) ----------------------------------------------------- From pjotr2008 at thebird.nl Fri Sep 19 14:05:14 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Fri, 19 Sep 2008 16:05:14 +0200 Subject: [BioRuby] RFC Unit testing large files Message-ID: <20080919140514.GA32740@thebird.nl> For microarray unit tests I have some 30Mb of files. Probably not very nice to put those in the source tree. The options are: 1. Host them in the source tree - huge downloads for everyone. 2. Fetch them on demand by the unit tests - takes long time the first time and where do I put them? In a cache directory? 3. Have the unit tests in a separate tree - special purpose testing 4. No unit tests for these I have the same unit tests in the biolib tree - but that is a hassle too. For BioRuby I propose (3). Maybe I ought to solely use the biolib tree for these specific unit tests and have a 'stub' in the Bioruby tree for them. This problem will come back - and keep in mind the free github space is 'only' 100 Mb. Pj. From pjotr2008 at thebird.nl Fri Sep 19 15:29:54 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Fri, 19 Sep 2008 17:29:54 +0200 Subject: [BioRuby] RFC Unit testing large files In-Reply-To: <1221837547.6231.5.camel@454-2> References: <20080919140514.GA32740@thebird.nl> <1221837547.6231.5.camel@454-2> Message-ID: <20080919152954.GA2058@thebird.nl> It is not simply testing some small code. It is to verify, for example, that large files get read properly - and that RMA normalization does its job. Otherwise I would certainly opt for such a solution. Pj. On Fri, Sep 19, 2008 at 05:19:07PM +0200, Raoul Jean Pierre Bonnal wrote: > Il giorno ven, 19/09/2008 alle 16.05 +0200, Pjotr Prins ha scritto: > > For microarray unit tests I have some 30Mb of files. Probably not > > very nice to put those in the source tree. The options are: > > > > 1. Host them in the source tree - huge downloads for everyone. > > > > 2. Fetch them on demand by the unit tests - takes long time the first > > time and where do I put them? In a cache directory? > > > > 3. Have the unit tests in a separate tree - special purpose testing > > > > 4. No unit tests for these > > > > I have the same unit tests in the biolib tree - but that is a hassle > > too. For BioRuby I propose (3). Maybe I ought to solely use the biolib > > tree for these specific unit tests and have a 'stub' in the Bioruby > > tree for them. > > > > This problem will come back - and keep in mind the free github space > > is 'only' 100 Mb. > > Create a piece of code to generate fake data for local test? > > -- > Ra From raoul.bonnal at itb.cnr.it Fri Sep 19 15:19:07 2008 From: raoul.bonnal at itb.cnr.it (Raoul Jean Pierre Bonnal) Date: Fri, 19 Sep 2008 17:19:07 +0200 Subject: [BioRuby] RFC Unit testing large files In-Reply-To: <20080919140514.GA32740@thebird.nl> References: <20080919140514.GA32740@thebird.nl> Message-ID: <1221837547.6231.5.camel@454-2> Il giorno ven, 19/09/2008 alle 16.05 +0200, Pjotr Prins ha scritto: > For microarray unit tests I have some 30Mb of files. Probably not > very nice to put those in the source tree. The options are: > > 1. Host them in the source tree - huge downloads for everyone. > > 2. Fetch them on demand by the unit tests - takes long time the first > time and where do I put them? In a cache directory? > > 3. Have the unit tests in a separate tree - special purpose testing > > 4. No unit tests for these > > I have the same unit tests in the biolib tree - but that is a hassle > too. For BioRuby I propose (3). Maybe I ought to solely use the biolib > tree for these specific unit tests and have a 'stub' in the Bioruby > tree for them. > > This problem will come back - and keep in mind the free github space > is 'only' 100 Mb. Create a piece of code to generate fake data for local test? -- Ra From pjotr2008 at thebird.nl Tue Sep 23 11:58:52 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Tue, 23 Sep 2008 13:58:52 +0200 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080918063237.GA17631@thebird.nl> References: <20080902065055.GA29634@thebird.nl> <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> <20080902091958.GA31400@thebird.nl> <20080909113816.GA10051@thebird.nl> <20080910014822.36F5F1CBC3D3@idnmail.gen-info.osaka-u.ac.jp> <20080910074858.GA16861@thebird.nl> <20080918031700.074B21CBC498@idnmail.gen-info.osaka-u.ac.jp> <20080918063237.GA17631@thebird.nl> Message-ID: <20080923115852.GA6808@thebird.nl> Hi Naohisa, I fixed the Cache to be secure. It will use a safe Tmpdir if no directory is specified and raise SecurityErrors when appropriate. See http://github.com/pjotrp/bioruby/tree/master Pj. On Thu, Sep 18, 2008 at 08:32:37AM +0200, Pjotr Prins wrote: > Hi Naohisa, > > On Thu, Sep 18, 2008 at 12:16:59PM +0900, Naohisa GOTO wrote: > > Hi Pjotr, > > > > If you don't want to implement any access control, > > using world writable directory like /tmp (comes from > > ENV['TMPDIR'] or Dir.tmpdir) by default should be disabled, > > because this is vulnerable to a symbolic link attack. > > > > About symbolic link attack, please refer documents: > > http://www.codeproject.com/KB/web-security/TemporaryFileSecurity.aspx > > (Note that Ruby's standard TempFile has no problem.) > > I agree - assuming you are running a webservice for microarrays. > > > When the "cache" directory isn't explicitly specified > > by user by using the environment variable BIORUBY_CACHE > > (or command-line options of custom application), > > doing without cache should be the default. > > NCBI won't be happy with that. But if that is what Bioruby wants... > It is not only about my own bandwidth ;-). > > > It is also good to raise SecurityError when the specified > > directory is writable by everyone. > > I'll remove tmpdir - I introduced it because of an earlier mail. > > Disabling the cache is easy - off course. Another option is to use > TmpFiles and keep track of those in a Hash (I'd rather not have large > IO objects in memory). OK, that is what I'll implement - assuming you > want to include the microarray stuff in Bioruby. > > Pj. > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From ngoto at gen-info.osaka-u.ac.jp Wed Sep 24 07:52:45 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 24 Sep 2008 16:52:45 +0900 Subject: [BioRuby] Bio::Blast::RPSBlast::Report In-Reply-To: <48D111D2.8010404@broad.mit.edu> References: <48D00A33.2050906@broad.mit.edu> <20080917024429.0B3A91501EF@idnmail.gen-info.osaka-u.ac.jp> <48D111D2.8010404@broad.mit.edu> Message-ID: <20080924075246.24F291CBC49F@idnmail.gen-info.osaka-u.ac.jp> The Bio::Blast::RPSBlast was introduced in April 2008, but bioruby 1.2.1, current latest release version, was released in December 2007. This means you need unreleased development version of bioruby in the github. You can download snapshot as a tarball http://github.com/bioruby/bioruby/tarball/master and install it (or extract it and set -I option or RUBYLIB enviroment etc.) Alternative way is to use git (see http://github.com/bioruby/bioruby/wikis ). As it is developmental version, it is unstable, something may not work frequently, and incompatible changes may be made. Please upgrade to new version immediately after new version released. In addtion, after commit 11f1787cf93c046c06d4a33a554210d56866274e, the limitation of multi-fasta report is eliminated when using with Bio::FlatFile. require 'bio' filename = 'test.rpsblast' Bio::FlatFile.open(Bio::Blast::RPSBlast::Report, filename) do |ff| i = 0 ff.each do |e| i += 1 print "Query\##{i} = ", e.query_def, "\n" j = 0 e.each do |hit| j += 1 print "Query\##{i}/Hit\##{j} = ", hit.target_def, "\n" k = 0 hit.each do |hsp| k += 1 print "Query\##{i}/Hit\##{j}/Hsp\##{k} = ", value=#{hsp.evalue}, ", "Positions #{hsp.query_from}..#{hsp.query_to}:", "#{hsp.hit_from}..#{hsp.hit_to}\n" print "Query : #{hsp.qseq}\n" print " #{hsp.midline}\n" print "Hit : #{hsp.hseq}\n" end end end end Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Wed, 17 Sep 2008 10:18:58 -0400 Sharvari Gujja wrote: > Hi, > > Thank you so much for the info. However, on running the code for > rpsblast output parser, I get the following error: > > *uninitialized constant Bio::Blast::RPSBlast (NameError)* > > I am not sure what exactly I am missing here. > > I really appreciate all the help. > > Thanks > S > > Naohisa GOTO wrote: > > On Tue, 16 Sep 2008 15:34:11 -0400 > > Sharvari Gujja wrote: > > > > > >> Hi, > >> > >> Can someone please direct me to Bio::Blast::RPSBlast::Report > >> documentation/examples ? > >> > > > > http://lists.open-bio.org/pipermail/bioruby/2008-April/000624.html > > > > Note that the Bio::Blast::RPSBlast::Report exists still > > only in development version, and the spec and usage > > would be changed before the release version in near future. > > > > Naohisa Goto > > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > _______________________________________________ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > From ngoto at gen-info.osaka-u.ac.jp Wed Sep 24 13:38:19 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 24 Sep 2008 22:38:19 +0900 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080923115852.GA6808@thebird.nl> References: <20080902065055.GA29634@thebird.nl> <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> <20080902091958.GA31400@thebird.nl> <20080909113816.GA10051@thebird.nl> <20080910014822.36F5F1CBC3D3@idnmail.gen-info.osaka-u.ac.jp> <20080910074858.GA16861@thebird.nl> <20080918031700.074B21CBC498@idnmail.gen-info.osaka-u.ac.jp> <20080918063237.GA17631@thebird.nl> <20080923115852.GA6808@thebird.nl> Message-ID: <20080924133825.B61B41CBC3D8@idnmail.gen-info.osaka-u.ac.jp> Hi Pjotr, I've seen files in your lib/bio/db/microarray, and I suppose it's still under development and it will be changed frequently, and I think it's not a time to include them in main bioruby. So, my comments below are mainly for future improvements. 1. about cache.rb The "safe = true" argument in 'set' and 'directory' seems bad idea. I think there is no need to give insecure options to users. In 'directory' method, > cache = Dir.mktmpdir(subdir) The Dir.mktmpdir method is a new feature added in Ruby 1.8.7, and not available in 1.8.6 and older versions. Because most users are still using Ruby 1.8.5 and 1.8.6, to avoid using Dir.mktmpdir is currently a choice. Alternatively, write a document that the feature can work only in Ruby 1.8.7 or later. Note that current requirement of BioRuby is "Ruby 1.8.2 or later (Ruby 1.8.4 or later is recommended)". Also note that FileUtils.remove_entry_secure was introduced in Ruby 1.8.3. Finally, I'm wondering if the Cache class can still be a singleton or not in the future. Currently, only NCBI_GEO is using the cache, but if it were used from many classes with different data formats, files with different formats would be existed in the same cache directory, and file name conflicts might be happened. 2. About file locations Below are recommended to be moved to bio/io/, because their main purpose is file or network I/O, and not data parsing. bio/db/microarray/cache.rb Bio::Microarray::GEO::XML in bio/db/microarray/ncbi_geo/geo.rb The class/module names are not needed to be changed. The files with external dependency to the "biolib" might also be suggested to be moved from bio/db to the other location, but no best location found. 3. BIo::Microarray::NCBI_GEO In bio/db/microarray/ncbi_geo/geo.rb, > include REXML If the aim to include REXML module is only to skip the REXML:: prefix, I don't like to include it in library, because the constants and methods defined in REXML are mixed and they might cause bad side effects. (Note that unlike in a library, it is free to include anything in an application.) > def XML::create(acc) In my impression, the method name "XML.create" might be reserved to be used by a method to create XML data structure from scratch or from some data. To define a class method, I like 'def self.create(acc)' because it is easy to change class (module) name. > def XML::fetch(xmlfn, acc) > url = "http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=#{acc}&form=xml&view=brief&retmode=xml" URI escaping is needed, e.g. acc=#{URI.escape(acc)} > print "Fetching ",url,"\n" if $VERBOSE > r = Net::HTTP.get_response( URI.parse( url ) ) To support proxy, use Bio::Command.get_uri(url). > def XML::valid_accession?(acc = nil) > acc = @acc if not acc > acc =~ /^(GSM|GSE|GPL)\d+$/ If "GSM0123\nGSM4567" is invalid, the regular expression should be /\A(GSM|GSE|GPL)\d+\z/ . > def XML::parsexml(acc) Is there no way to get input XML data as String? > if XML::valid_accession? acc > cache = Cache.instance.directory > fn = cache+'/'+acc+'.xml' Please use File.join. Thanks, Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Tue, 23 Sep 2008 13:58:52 +0200 pjotr2008 at thebird.nl (Pjotr Prins) wrote: > Hi Naohisa, > > I fixed the Cache to be secure. It will use a safe Tmpdir if no > directory is specified and raise SecurityErrors when appropriate. > > See http://github.com/pjotrp/bioruby/tree/master > > Pj. > > On Thu, Sep 18, 2008 at 08:32:37AM +0200, Pjotr Prins wrote: > > Hi Naohisa, > > > > On Thu, Sep 18, 2008 at 12:16:59PM +0900, Naohisa GOTO wrote: > > > Hi Pjotr, > > > > > > If you don't want to implement any access control, > > > using world writable directory like /tmp (comes from > > > ENV['TMPDIR'] or Dir.tmpdir) by default should be disabled, > > > because this is vulnerable to a symbolic link attack. > > > > > > About symbolic link attack, please refer documents: > > > http://www.codeproject.com/KB/web-security/TemporaryFileSecurity.aspx > > > (Note that Ruby's standard TempFile has no problem.) > > > > I agree - assuming you are running a webservice for microarrays. > > > > > When the "cache" directory isn't explicitly specified > > > by user by using the environment variable BIORUBY_CACHE > > > (or command-line options of custom application), > > > doing without cache should be the default. > > > > NCBI won't be happy with that. But if that is what Bioruby wants... > > It is not only about my own bandwidth ;-). > > > > > It is also good to raise SecurityError when the specified > > > directory is writable by everyone. > > > > I'll remove tmpdir - I introduced it because of an earlier mail. > > > > Disabling the cache is easy - off course. Another option is to use > > TmpFiles and keep track of those in a Hash (I'd rather not have large > > IO objects in memory). OK, that is what I'll implement - assuming you > > want to include the microarray stuff in Bioruby. > > > > Pj. > > _______________________________________________ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby From ngoto at gen-info.osaka-u.ac.jp Wed Sep 24 14:05:26 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 24 Sep 2008 23:05:26 +0900 Subject: [BioRuby] GFF attributes In-Reply-To: <20080917035620.3EEA2150201@idnmail.gen-info.osaka-u.ac.jp> References: <20080909114748.555DB1CBC528@idnmail.gen-info.osaka-u.ac.jp> <20080917035620.3EEA2150201@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20080924140526.E779C1CBC3C3@idnmail.gen-info.osaka-u.ac.jp> Hi, In my github repository, I've made incompatible changes in Bio::GFF::GFF2 and Bio::GFF::GFF3 classes. Now, attributes are stored as an Array containing [ tag, value ] pairs, for example, [ [ 'Gene', 'CEN1' ], [ 'E_value', '0.0003' ], [ 'Note', 'CEN1; Chromosome I Centromere' ] ]. To get an attribute, it is recommended to use a new method Record#arrtibute(tag) and so on. String escaping in free text is automatically processed. In addition, GFF2 attribute value with multiple tokens e.g. 'Target "HBA_HUMAN" 11 55' are parsed to Bio::GFF::GFF2::Record::Value object. (Note that a value with single token is still a String). To keep backward compatibility, the specification of Bio::GFF is not so changed except for bug fix. To use new feature, Bio::GFF::GFF2 or Bio::GFF::GFF3 should be explicitly used. For more details, please see http://github.com/ngoto/bioruby/commit/95391949d217e6f7c9ee7444afebec6ee8677035 If no problems are found, it will be included in the main bioruby repository. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Wed, 17 Sep 2008 12:56:19 +0900 Naohisa GOTO wrote: > Hi, > > On Thu, 11 Sep 2008 11:34:36 +0900 > Tomoaki NISHIYAMA wrote: > > > Hi > > > > > To prevent repeating the bug, I want to use the GFF string > > > described in your mail for the test script in BioRuby. > > > (test/unit/bio/db/test_gff.rb) > > > Can you give permission? > > > > Surely, I have no objection. > > The string is one of the line in the Popular genome annotation from > > the JGI site. > > ftp://ftp.jgi-psf.org/pub/JGI_data/Poplar/annotation/v1.1/ > > Poptr1_1.JamboreeModels.gff.gz > > So, I think acknowledging them is a good idea. > > Thank you. I'll add above URL in the comments of the test. > > > For test string, I think another pattern including multiple value for > > one key is worth to add. > > The example from http://www.sanger.ac.uk/Software/formats/GFF/ > > GFF_Spec.shtml: > > seq1 BLASTX similarity 101 235 87.1 + 0 Target "HBA_HUMAN" 11 > > 55 ; E_value 0.0003 > > > > Perhaps current implementation will return '"HBA_HUMAN" 11 55' as the > > value for 'Target'. > > But returning an Array ['"HBA_HUMAN"', '11', '55'] may be more > > sensible, or represent > > more of the meaning of the specification. > > In this case, string escaping and quotation in free text > can also be processed by the class, and > [ 'HBA_HUMAN', 11', '55'] can be returned. > > > Since changing this return value will make incompatibilities, I'm not > > sure > > whether it can be changed. > > But if it is ever to be changed, it is better changed early, or > > stated as such. > > If it is too late, perhaps we can make a method under a different > > name so that > > currently working code will not be affected. > > Indeed, for GFF2 attributes, I've alrealy found a > design problem in current Bio::GFF::GFF2#attributes. > Currently, a hash is used to store attributes, but > the GFF2 spec allows more than two tags with the same name. > > For example, > http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml#homology_feature > Align 101 11 ; Align 179 36 ; > > In this case, with current bioruby implementation, the > "Align 101 11" is overwritten by the latter "Align 179 36", > and we can only get { "Align" => "179 36" }. > > To solve the problem, I can think the following two ways. > > 1. Using an Array to store values from multiple tags. > > For example, in the above case, > @attributes = {} > @attribures['Align'] = [ '101 11', '179 36' ] > @attribures['Target'] = '"HBA_HUMAN" 11 54' > > I already took this approach in GFF3 with incompatible > changes, because the previous implementation of > GFF3#attributes was broken and cannot be used. > But now, I just think this approch is not good and > I want to change it now, because checking whether > the value is an array or not is needed every time. > > In addition, in this case, we can not parse > '"HBA_HUMAN" 11 54' to [ 'HBA_HUMAN', 11', '54'], > because it is impossible to distinguish values from > multiple tags or parsed values, unless an array is > always used. > > 2. Giving up using hash, and using an array (or possibly > a new class e.g. GFF2::Attributes) of [ tag, value ] > pairs. > > For backward compatibility, hash can be dynamically > generated when old behavior is requested. > > I think this approach is better. > I'll implement this later. > > Any comments and suggestions are welcome. > > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From pjotr2008 at thebird.nl Wed Sep 24 16:29:24 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Wed, 24 Sep 2008 18:29:24 +0200 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080924133825.B61B41CBC3D8@idnmail.gen-info.osaka-u.ac.jp> References: <20080902065055.GA29634@thebird.nl> <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> <20080902091958.GA31400@thebird.nl> <20080909113816.GA10051@thebird.nl> <20080910014822.36F5F1CBC3D3@idnmail.gen-info.osaka-u.ac.jp> <20080910074858.GA16861@thebird.nl> <20080918031700.074B21CBC498@idnmail.gen-info.osaka-u.ac.jp> <20080918063237.GA17631@thebird.nl> <20080923115852.GA6808@thebird.nl> <20080924133825.B61B41CBC3D8@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20080924162924.GA19778@thebird.nl> Hi Naohisa, On Wed, Sep 24, 2008 at 10:38:19PM +0900, Naohisa GOTO wrote: > Hi Pjotr, > > I've seen files in your lib/bio/db/microarray, and I suppose > it's still under development and it will be changed frequently, > and I think it's not a time to include them in main bioruby. > So, my comments below are mainly for future improvements. What there is is 'stable'. Certainly the NCBI stuff is rather complete. The biolib libraries could go in later. It is up to you, but I think it would be nice to have mainstream microarray support before one of the other Bio* libraries (and biolib support is there for all). We don't want to be beaten by BioPerl, for one ;-). If nothing else I can make a BioRuby-with-Microarrays gem available - but that may be confusing for others. Another thing, what is the point of open source software if no one tests it. How about regularly releasing a testing version of bioruby? We see some more activity in BioRuby - which is a good thing. You can't expect things to be ready from the word GO! Meanwhile, I do appreciate your comments. It is forcing me to write better code. Teaching an old fox new tricks ;-) > 1. about cache.rb > > The "safe = true" argument in 'set' and 'directory' seems > bad idea. I think there is no need to give insecure options > to users. I'll remove it if you wish. I think it is up to the implementor - if you have a web service you better use the default safe mode. Otherwise, who cares. I, for one, would like to use /tmp in some cases. > In 'directory' method, > > cache = Dir.mktmpdir(subdir) > > The Dir.mktmpdir method is a new feature added in Ruby 1.8.7, > and not available in 1.8.6 and older versions. > Because most users are still using Ruby 1.8.5 and 1.8.6, > to avoid using Dir.mktmpdir is currently a choice. > Alternatively, write a document that the feature can work > only in Ruby 1.8.7 or later. Yes we can document that. Using microarray bindings a later Ruby is a good idea anyway. > Note that current requirement of BioRuby is > "Ruby 1.8.2 or later (Ruby 1.8.4 or later is recommended)". > Also note that FileUtils.remove_entry_secure was introduced > in Ruby 1.8.3. Well, the modules are optionally included. It shouldn't break if people don't use the microarray stuff. This is true for the dependency on external biolib too. > Finally, I'm wondering if the Cache class can still be > a singleton or not in the future. Currently, only NCBI_GEO > is using the cache, but if it were used from many classes > with different data formats, files with different formats > would be existed in the same cache directory, and file name > conflicts might be happened. This implementation is such that we create a shared dir, with classes using different subfolders - i.e. tmpdir/GEO/. This prevents name clashes between modules. My current GEO cache is 30 Mb. If I were to download that every time my research would be severely hampered. I think it is very useful and could also be for running webservices of other modules. You don't want web servers to retain everything in memory. > 2. About file locations > > Below are recommended to be moved to bio/io/, > because their main purpose is file or network I/O, > and not data parsing. > bio/db/microarray/cache.rb OK. > Bio::Microarray::GEO::XML in bio/db/microarray/ncbi_geo/geo.rb It does NCBI XML parsing - but that is not what you mean? > The class/module names are not needed to be changed. > > The files with external dependency to the "biolib" might > also be suggested to be moved from bio/db to the other > location, but no best location found. heh - anyone else a suggestiong? The biolib stuff does do microarray loading and will do normalization and analysis soon. > 3. BIo::Microarray::NCBI_GEO > > In bio/db/microarray/ncbi_geo/geo.rb, > > > include REXML > > If the aim to include REXML module is only to skip the > REXML:: prefix, I don't like to include it in library, > because the constants and methods defined in REXML are > mixed and they might cause bad side effects. > (Note that unlike in a library, it is free to include > anything in an application.) OK > > def XML::create(acc) > > In my impression, the method name "XML.create" might be > reserved to be used by a method to create XML data structure > from scratch or from some data. > To define a class method, I like 'def self.create(acc)' > because it is easy to change class (module) name. It is a class factory. I'll have a think. > > def XML::fetch(xmlfn, acc) > > url = "http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=#{acc}&form=xml&view=brief&retmode=xml" > > URI escaping is needed, e.g. acc=#{URI.escape(acc)} > > > print "Fetching ",url,"\n" if $VERBOSE > > r = Net::HTTP.get_response( URI.parse( url ) ) > > To support proxy, use Bio::Command.get_uri(url). OK and OK > > def XML::valid_accession?(acc = nil) > > acc = @acc if not acc > > acc =~ /^(GSM|GSE|GPL)\d+$/ > > If "GSM0123\nGSM4567" is invalid, the regular expression > should be /\A(GSM|GSE|GPL)\d+\z/ . good point. > > def XML::parsexml(acc) > > Is there no way to get input XML data as String? Sigh. Sure there is. Of from a file. An IO object would be cool. Maybe the next version. > > if XML::valid_accession? acc > > cache = Cache.instance.directory > > fn = cache+'/'+acc+'.xml' > > Please use File.join. Sorry. OK. Pj. From davide.rambaldi at ifom-ieo-campus.it Thu Sep 25 07:35:54 2008 From: davide.rambaldi at ifom-ieo-campus.it (Davide Rambaldi) Date: Thu, 25 Sep 2008 09:35:54 +0200 Subject: [BioRuby] test and bioruby shell questions In-Reply-To: <992DF6FE-14DF-45B9-B141-224FF473E82B@hgc.jp> References: <3AF7B27B-E642-48FD-A612-ED88E9FF28BC@ifom-ieo-campus.it> <992DF6FE-14DF-45B9-B141-224FF473E82B@hgc.jp> Message-ID: <6B068F99-56FA-4E7E-AB60-887B83480F05@ifom-ieo-campus.it> On Aug 30, 2008, at 2:16 PM, Toshiaki Katayama wrote: > The demo above was designed to utilize the KEGG API, which is a > SOAP based web service, > so we need to change the default data source to obtain this entry. > We can fix this by switching to use NCBI's efetch method instead. I manage to write a fix for this... is really horrible actually (but it works) I have inserted my code in the nested if/else that retrieve the entry, so after the KEGG API try, the shell try NCBI::REST.efetch Oni:~/src/bioruby tucano$ git diff lib/bio/shell/plugin/entry.rb diff --git a/lib/bio/shell/plugin/entry.rb b/lib/bio/shell/plugin/ entry.rb index 6d36fb5..0a45ecd 100644 --- a/lib/bio/shell/plugin/entry.rb +++ b/lib/bio/shell/plugin/entry.rb @@ -88,8 +88,16 @@ module Bio::Shell # KEGG API at http://www.genome.jp/kegg/soap/ else - puts "Retrieving entry from KEGG API (#{arg})" entry = bget(arg) + if $?.exitstatus == 0 and str.length != 0 + puts "Retrieving entry from KEGG API (#{arg})" + else + # efetch from NCBI + puts "Retrieving entry from NCBI (#{arg})" + require 'bio/io/ncbirest.rb' + fetch = Bio::NCBI::REST.efetch("AF237819", {"db"=>"nuccore", "rettype"=>"gb"}) + entry = fetch.to_s + end end end So the questions/comments: 1. I have added the require 'bio/io/ncbirest.rb' beacuse in bio.rb ncbirest.rb is not loaded (only SOAP). Is a bug or a feature? 2. the standard demo command now is able to retrieve the genbank entry, but generate an error in the MIDI file generation: bioruby> midifile("data/AF237819.mid", kuma.naseq) Saving MIDI file (data/AF237819.mid) ... Error: Failed to save (data/ AF237819.mid) : No such file or directory - data/AF237819.mid any clue for this FAil? by the way: wow a module to translate a sequence in music? I really wont to test it also! I have made a software that do something similar: http://recipient.cc/playgene/ Is made with a perl (?!) script to efetch sequence from NCBI, and a flash application for the interface and to load the music library... :-) Best Regards Davide Rambaldi, Bioinformatics PhD student. ----------------------------------------------------- Bioinformatic Group IFOM-IEO Campus Via Adamello 16, Milano I-20139 Italy [t] +39 02574303 066 [e] davide.rambaldi at ifom-ieo-campus.it [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/DavideRambaldi (homepage) [i] http://www.semm.it (PhD school) [i] http://www.btbs.unimib.it/ (Master) ----------------------------------------------------- From ngoto at gen-info.osaka-u.ac.jp Thu Sep 25 14:58:17 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 25 Sep 2008 23:58:17 +0900 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080924162924.GA19778@thebird.nl> References: <20080902065055.GA29634@thebird.nl> <20080902084712.AEDF81CBC46F@idnmail.gen-info.osaka-u.ac.jp> <20080902091958.GA31400@thebird.nl> <20080909113816.GA10051@thebird.nl> <20080910014822.36F5F1CBC3D3@idnmail.gen-info.osaka-u.ac.jp> <20080910074858.GA16861@thebird.nl> <20080918031700.074B21CBC498@idnmail.gen-info.osaka-u.ac.jp> <20080918063237.GA17631@thebird.nl> <20080923115852.GA6808@thebird.nl> <20080924133825.B61B41CBC3D8@idnmail.gen-info.osaka-u.ac.jp> <20080924162924.GA19778@thebird.nl> Message-ID: <20080925145817.90B1B1CBC3F4@idnmail.gen-info.osaka-u.ac.jp> Hi, On Wed, 24 Sep 2008 18:29:24 +0200 pjotr2008 at thebird.nl (Pjotr Prins) wrote: > Hi Naohisa, > > On Wed, Sep 24, 2008 at 10:38:19PM +0900, Naohisa GOTO wrote: > > Hi Pjotr, > > > > I've seen files in your lib/bio/db/microarray, and I suppose > > it's still under development and it will be changed frequently, > > and I think it's not a time to include them in main bioruby. > > So, my comments below are mainly for future improvements. > > What there is is 'stable'. Certainly the NCBI stuff is rather complete. The > biolib libraries could go in later. It is up to you, but I think it would be > nice to have mainstream microarray support before one of the other Bio* > libraries (and biolib support is there for all). We don't want to be beaten by > BioPerl, for one ;-). If nothing else I can make a BioRuby-with-Microarrays gem > available - but that may be confusing for others. I agree it is good to have microarray support, if it is useful. Could you please show short examples and use cases of the microarray support? > Another thing, what is the point of open source software if no one tests it. > How about regularly releasing a testing version of bioruby? We see some more > activity in BioRuby - which is a good thing. You can't expect things to be > ready from the word GO! I think new version should be released soon, but currently, there is no release management. > Meanwhile, I do appreciate your comments. It is forcing me to write better > code. Teaching an old fox new tricks ;-) > > > 1. about cache.rb > > > > The "safe = true" argument in 'set' and 'directory' seems > > bad idea. I think there is no need to give insecure options > > to users. > > I'll remove it if you wish. I think it is up to the implementor - if you have a > web service you better use the default safe mode. Otherwise, who cares. I, for > one, would like to use /tmp in some cases. I wish it is to be removed. Recently, temporary file vulnerability in software not directly related to server services have also been treated as security issue, e.g. f2c (fortran to C converter) http://www.debian.org/security/2005/dsa-661 So, it's good not to give a chance of insecure operation. > > In 'directory' method, > > > cache = Dir.mktmpdir(subdir) > > > > The Dir.mktmpdir method is a new feature added in Ruby 1.8.7, > > and not available in 1.8.6 and older versions. > > Because most users are still using Ruby 1.8.5 and 1.8.6, > > to avoid using Dir.mktmpdir is currently a choice. > > Alternatively, write a document that the feature can work > > only in Ruby 1.8.7 or later. > > Yes we can document that. Using microarray bindings a later Ruby is a > good idea anyway. OK. Question: Does the microarray support work on Ruby 1.9? Most part of bioruby still do not support Ruby 1.9, though some code can run on Ruby 1.9. > > Note that current requirement of BioRuby is > > "Ruby 1.8.2 or later (Ruby 1.8.4 or later is recommended)". > > Also note that FileUtils.remove_entry_secure was introduced > > in Ruby 1.8.3. > > Well, the modules are optionally included. It shouldn't break if > people don't use the microarray stuff. This is true for the dependency > on external biolib too. OK. > > Finally, I'm wondering if the Cache class can still be > > a singleton or not in the future. Currently, only NCBI_GEO > > is using the cache, but if it were used from many classes > > with different data formats, files with different formats > > would be existed in the same cache directory, and file name > > conflicts might be happened. > > This implementation is such that we create a shared dir, with classes using > different subfolders - i.e. tmpdir/GEO/. This prevents name clashes between > modules. My current GEO cache is 30 Mb. If I were to download that every time > my research would be severely hampered. I think it is very useful and could > also be for running webservices of other modules. You don't want web servers > to retain everything in memory. In the current implementation, the singleton object stores @subdir, and it is the same as a global variable. For example, If a user want to get both GEO and ArrayExpress (hopefully supported in the future), and I wrote a code like this: Bio::Microarray::Cache.set('/home/who/.bioruby-cache') obj1 = Bio::Microarray::GEO::GSE.new('GSE1') obj2 = Bio::Microarray::ArrayExpress.new('Acc2') obj3 = Bio::Microarray::GEO::GSE.new('GSE3') obj4 = Bio::Microarray::ArrayExpress.new('Acc4') In this case, how to specify sub directory? Or, am I misunderstanding @subdir? BTW, FYI, there is memcached, on-memory cache for web server. http://www.danga.com/memcached/ > > 2. About file locations > > > > Below are recommended to be moved to bio/io/, > > because their main purpose is file or network I/O, > > and not data parsing. > > bio/db/microarray/cache.rb > > OK. > > > Bio::Microarray::GEO::XML in bio/db/microarray/ncbi_geo/geo.rb > > It does NCBI XML parsing - but that is not what you mean? I meant only XML.create, XML.fetch, and XML.parsexml methods. But, because they are short, I think again that no need to move them. For microarray data, or for large-scale data, because of efficiency, I can understand that close relationship between I/O and data format class is needed. However, from the viewpoint to treat various data from various databases, separating I/O and data parsing is better, maybe in the future. > > The class/module names are not needed to be changed. > > > > The files with external dependency to the "biolib" might > > also be suggested to be moved from bio/db to the other > > location, but no best location found. > > heh - anyone else a suggestiong? The biolib stuff does do microarray loading > and will do normalization and analysis soon. > > > 3. BIo::Microarray::NCBI_GEO > > > > In bio/db/microarray/ncbi_geo/geo.rb, > > > > > include REXML > > > > If the aim to include REXML module is only to skip the > > REXML:: prefix, I don't like to include it in library, > > because the constants and methods defined in REXML are > > mixed and they might cause bad side effects. > > (Note that unlike in a library, it is free to include > > anything in an application.) > > OK > > > > def XML::create(acc) > > > > In my impression, the method name "XML.create" might be > > reserved to be used by a method to create XML data structure > > from scratch or from some data. > > > To define a class method, I like 'def self.create(acc)' > > because it is easy to change class (module) name. > > It is a class factory. I'll have a think. I suggest Bio::Microarray::GEO::XML.new(acc). > > > def XML::fetch(xmlfn, acc) > > > url = "http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=#{acc}&form=xml&view=brief&retmode=xml" > > > > URI escaping is needed, e.g. acc=#{URI.escape(acc)} > > > > > print "Fetching ",url,"\n" if $VERBOSE > > > r = Net::HTTP.get_response( URI.parse( url ) ) > > > > To support proxy, use Bio::Command.get_uri(url). > > OK and OK > > > > def XML::valid_accession?(acc = nil) > > > acc = @acc if not acc > > > acc =~ /^(GSM|GSE|GPL)\d+$/ > > > > If "GSM0123\nGSM4567" is invalid, the regular expression > > should be /\A(GSM|GSE|GPL)\d+\z/ . > > good point. > > > > def XML::parsexml(acc) > > > > Is there no way to get input XML data as String? > > Sigh. Sure there is. Of from a file. An IO object would be cool. > Maybe the next version. > > > > if XML::valid_accession? acc > > > cache = Cache.instance.directory > > > fn = cache+'/'+acc+'.xml' > > > > Please use File.join. > > Sorry. OK. > > Pj. > Thanks, Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From davide.rambaldi at ifom-ieo-campus.it Thu Sep 25 16:18:39 2008 From: davide.rambaldi at ifom-ieo-campus.it (Davide Rambaldi) Date: Thu, 25 Sep 2008 18:18:39 +0200 Subject: [BioRuby] bioruby shell Message-ID: <32C805D6-081B-4616-BBE2-26645CCC8146@ifom-ieo-campus.it> Hello, I have posted as a reply in another thread a small modification to lib/bio/shell/plugin/entry.rb that resolve the demo problem (die after trying to download a genbank from KeggAPI): diff --git a/lib/bio/shell/plugin/entry.rb b/lib/bio/shell/plugin/ entry.rb index 6d36fb5..0a45ecd 100644 --- a/lib/bio/shell/plugin/entry.rb +++ b/lib/bio/shell/plugin/entry.rb @@ -88,8 +88,16 @@ module Bio::Shell # KEGG API at http://www.genome.jp/kegg/soap/ else - puts "Retrieving entry from KEGG API (#{arg})" entry = bget(arg) + if $?.exitstatus == 0 and str.length != 0 + puts "Retrieving entry from KEGG API (#{arg})" + else + # efetch from NCBI + puts "Retrieving entry from NCBI (#{arg})" + require 'bio/io/ncbirest.rb' + fetch = Bio::NCBI::REST.efetch("AF237819", {"db"=>"nuccore", "rettype"=>"gb"}) + entry = fetch.to_s + end end end I have some other ideas for the shell: - adding a method to remove all saved objects - making an help (at least for demo, ls and rm commands) - adding an OptionParser In general I want to propose some other simple modification to this part of the bioruby library. I am losing my time? there is another person on this? or I can go on? Many thanks for feedback P.S: my simple BLAT application blatanalyzer is now accessible via svn at svn checkout svn://rubyforge.org/var/svn/blatanalyzer/trunk any feedback is really appreciated thanks again Davide Rambaldi, Bioinformatics PhD student. ----------------------------------------------------- Bioinformatic Group IFOM-IEO Campus Via Adamello 16, Milano I-20139 Italy [t] +39 02574303 066 [e] davide.rambaldi at ifom-ieo-campus.it [i] http://ciccarelli.group.ifom-ieo-campus.it/fcwiki/DavideRambaldi (homepage) [i] http://www.semm.it (PhD school) [i] http://www.btbs.unimib.it/ (Master) ----------------------------------------------------- From ngoto at gen-info.osaka-u.ac.jp Fri Sep 26 13:37:33 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Fri, 26 Sep 2008 22:37:33 +0900 Subject: [BioRuby] test and bioruby shell questions In-Reply-To: <6B068F99-56FA-4E7E-AB60-887B83480F05@ifom-ieo-campus.it> References: <3AF7B27B-E642-48FD-A612-ED88E9FF28BC@ifom-ieo-campus.it> <992DF6FE-14DF-45B9-B141-224FF473E82B@hgc.jp> <6B068F99-56FA-4E7E-AB60-887B83480F05@ifom-ieo-campus.it> Message-ID: <20080926133733.A99061CBC3F0@idnmail.gen-info.osaka-u.ac.jp> Hi, On Thu, 25 Sep 2008 09:35:54 +0200 Davide Rambaldi wrote: > On Aug 30, 2008, at 2:16 PM, Toshiaki Katayama wrote: > > > The demo above was designed to utilize the KEGG API, which is a > > SOAP based web service, > > so we need to change the default data source to obtain this entry. > > We can fix this by switching to use NCBI's efetch method instead. > > > > I manage to write a fix for this... is really horrible actually (but > it works) > I have inserted my code in the nested if/else that retrieve the > entry, so after the KEGG API try, the shell try NCBI::REST.efetch > > Oni:~/src/bioruby tucano$ git diff lib/bio/shell/plugin/entry.rb > diff --git a/lib/bio/shell/plugin/entry.rb b/lib/bio/shell/plugin/ > entry.rb > index 6d36fb5..0a45ecd 100644 > --- a/lib/bio/shell/plugin/entry.rb > +++ b/lib/bio/shell/plugin/entry.rb > @@ -88,8 +88,16 @@ module Bio::Shell > > # KEGG API at http://www.genome.jp/kegg/soap/ > else > - puts "Retrieving entry from KEGG API (#{arg})" > entry = bget(arg) > + if $?.exitstatus == 0 and str.length != 0 > + puts "Retrieving entry from KEGG API (#{arg})" > + else > + # efetch from NCBI > + puts "Retrieving entry from NCBI (#{arg})" > + require 'bio/io/ncbirest.rb' > + fetch = Bio::NCBI::REST.efetch("AF237819", > {"db"=>"nuccore", "rettype"=>"gb"}) > + entry = fetch.to_s > + end > end > end Thank you for a patch, but it has some problems: For KEGG API, $?.exitstatus has no mean, and no need to check $?. The "AF237819" should not be hardcoded because the method is not only for demo, but a bioruby-shell command to fetch entry specified by a user. Also note that "db" => "nuccore" would not always be good. (If result is empty, switching to another database and trying again would be the best way.) > So the questions/comments: > > 1. I have added the require 'bio/io/ncbirest.rb' beacuse in bio.rb > ncbirest.rb is not loaded (only SOAP). Is a bug or a feature? This is a bug, and it will soon be fixed. > 2. the standard demo command now is able to retrieve the genbank > entry, but generate an error in the MIDI file generation: > > bioruby> midifile("data/AF237819.mid", kuma.naseq) > Saving MIDI file (data/AF237819.mid) ... Error: Failed to save (data/ > AF237819.mid) : No such file or directory - data/AF237819.mid > > any clue for this FAil? The error may be caused because directory named "data" did not exist, and the program cannot save the file. To solve this, simply do "mkdir data". > by the way: wow a module to translate a sequence in music? I really > wont to test it also! I have made a software that do something similar: > > http://recipient.cc/playgene/ > > Is made with a perl (?!) script to efetch sequence from NCBI, and a > flash application for the interface and to load the music library... :-) Yes, you can enjoy music. Thanks, Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From pjotr2008 at thebird.nl Mon Sep 29 12:34:11 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Mon, 29 Sep 2008 14:34:11 +0200 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080925145817.90B1B1CBC3F4@idnmail.gen-info.osaka-u.ac.jp> References: <20080902091958.GA31400@thebird.nl> <20080909113816.GA10051@thebird.nl> <20080910014822.36F5F1CBC3D3@idnmail.gen-info.osaka-u.ac.jp> <20080910074858.GA16861@thebird.nl> <20080918031700.074B21CBC498@idnmail.gen-info.osaka-u.ac.jp> <20080918063237.GA17631@thebird.nl> <20080923115852.GA6808@thebird.nl> <20080924133825.B61B41CBC3D8@idnmail.gen-info.osaka-u.ac.jp> <20080924162924.GA19778@thebird.nl> <20080925145817.90B1B1CBC3F4@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20080929123411.GA31668@thebird.nl> Hi Naohisa, On Thu, Sep 25, 2008 at 11:58:17PM +0900, Naohisa GOTO wrote: > I agree it is good to have microarray support, if it is useful. > Could you please show short examples and use cases of the > microarray support? You mean, like load file, read probe? There are unit tests for that in BioLib. I'll expand on the Tutorial once this goes into BioRuby. > Question: Does the microarray support work on Ruby 1.9? > Most part of bioruby still do not support Ruby 1.9, > though some code can run on Ruby 1.9. I will test my sources with 1.9. Should be no problem - no legacy stuff in there. > In the current implementation, the singleton object stores > @subdir, and it is the same as a global variable. > For example, If a user want to get both GEO and ArrayExpress > (hopefully supported in the future), and I wrote a code > like this: > > Bio::Microarray::Cache.set('/home/who/.bioruby-cache') > obj1 = Bio::Microarray::GEO::GSE.new('GSE1') > obj2 = Bio::Microarray::ArrayExpress.new('Acc2') > obj3 = Bio::Microarray::GEO::GSE.new('GSE3') > obj4 = Bio::Microarray::ArrayExpress.new('Acc4') > > In this case, how to specify sub directory? > Or, am I misunderstanding @subdir? Well, hey! You are making life a little difficult for me here. In an earlier mail you wrote: > Note that some classes use Tempfile class, a standard bundled > class with Ruby by default, and the Tempfile class depends > on enviroment variables (TMPDIR, TMP, etc.). So I introduced tmpdir - which I had to remove later. Also you wrote: > I think cache isn't suitable for standard, because its purpose > may differ from program (or class, module, etc.) to program. so I introduce a cache specific to the GEO module. This Cache definition is for GEO and used as such. There are no conflicts with other modules now - as there are none. Loading on demand is not a solution - as that would be unusable. The upside of a Singleton is that a cache gets defined once - and is not part of the normal interfaces. Modules can define their own subdirectories in the Cache. That would be OK. Lets not take this further until someone wants to build on this cache. It is not my itch to scratch. Like you wrote earlier, a cache implementation is non-trivial. Right. I wasn't intending to do that. The cache we have now is safe and sufficient for this module. I'll stick in a warning not to use the cache for other purposes. OK? > > It is a class factory. I'll have a think. > > I suggest Bio::Microarray::GEO::XML.new(acc). Not sure about that. The definition of 'new' is tied to initializing a class. Here we have a factory method, we need to distinguish. Code should really document itself. I think my 'create' is actually fine for a factory, but if anyone has another suggestion? These examples all use 'create': http://www.scribd.com/doc/396559/gof-patterns-in-ruby Pj. From ngoto at gen-info.osaka-u.ac.jp Mon Sep 29 20:26:39 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa Goto) Date: Tue, 30 Sep 2008 05:26:39 +0900 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080929123411.GA31668@thebird.nl> References: <20080925145817.90B1B1CBC3F4@idnmail.gen-info.osaka-u.ac.jp> <20080929123411.GA31668@thebird.nl> Message-ID: <20080930052354.9B6C.EEF6E030@gen-info.osaka-u.ac.jp> Hi Pjotr, > Hi Naohisa, > > On Thu, Sep 25, 2008 at 11:58:17PM +0900, Naohisa GOTO wrote: > > I agree it is good to have microarray support, if it is useful. > > Could you please show short examples and use cases of the > > microarray support? > > You mean, like load file, read probe? There are unit tests for that > in BioLib. I'll expand on the Tutorial once this goes into BioRuby. OK. > > Question: Does the microarray support work on Ruby 1.9? > > Most part of bioruby still do not support Ruby 1.9, > > though some code can run on Ruby 1.9. > > I will test my sources with 1.9. Should be no problem - no legacy > stuff in there. Now, don't mind if it fails to run on Ruby 1.9. We will be gradually migrating to 1.9 after the relase of Ruby 1.9.1 in the future, not now. > > In the current implementation, the singleton object stores > > @subdir, and it is the same as a global variable. > > For example, If a user want to get both GEO and ArrayExpress > > (hopefully supported in the future), and I wrote a code > > like this: > > > > Bio::Microarray::Cache.set('/home/who/.bioruby-cache') > > obj1 = Bio::Microarray::GEO::GSE.new('GSE1') > > obj2 = Bio::Microarray::ArrayExpress.new('Acc2') > > obj3 = Bio::Microarray::GEO::GSE.new('GSE3') > > obj4 = Bio::Microarray::ArrayExpress.new('Acc4') > > > > In this case, how to specify sub directory? > > Or, am I misunderstanding @subdir? > > Well, hey! You are making life a little difficult for me here. In an > earlier mail you wrote: > > > Note that some classes use Tempfile class, a standard bundled > > class with Ruby by default, and the Tempfile class depends > > on enviroment variables (TMPDIR, TMP, etc.). > > So I introduced tmpdir - which I had to remove later. Also you wrote: > > > I think cache isn't suitable for standard, because its purpose > > may differ from program (or class, module, etc.) to program. > > so I introduce a cache specific to the GEO module. This Cache > definition is for GEO and used as such. There are no conflicts with > other modules now - as there are none. Loading on demand is not a > solution - as that would be unusable. The name "Bio::Microarray::Cache" sounds as if this were common to all microarray classes. To make clear the Cache is only for GEO, please move the class under Bio::Microarray::GEO, i.e. the class name is changed from Bio::Microarray::Cache to Bio::Microarray::GEO::Cache. In addition, please move the file to bio/db/microarray/ncbi_geo/cache.rb (no need to move under bio/io because it is specific to GEO and not intended to be used with other classes/modules). > The upside of a Singleton is that a cache gets defined once - and is > not part of the normal interfaces. Modules can define their own > subdirectories in the Cache. That would be OK. > > Lets not take this further until someone wants to build on this > cache. It is not my itch to scratch. Like you wrote earlier, a cache > implementation is non-trivial. Right. I wasn't intending to do that. > The cache we have now is safe and sufficient for this module. > > I'll stick in a warning not to use the cache for other purposes. OK? OK. In BioRuby, there are already many classes/modules/methods with warning documents "users should not use it directly", "internal use only", etc. > > > It is a class factory. I'll have a think. > > > > I suggest Bio::Microarray::GEO::XML.new(acc). > > Not sure about that. The definition of 'new' is tied to initializing a > class. Here we have a factory method, we need to distinguish. Code > should really document itself. I think my 'create' is actually fine > for a factory, but if anyone has another suggestion? These examples > all use 'create': > > http://www.scribd.com/doc/396559/gof-patterns-in-ruby "create" will be used, if no good suggestion given. Though, maybe bioscientists don't know much about design patterns. -- Naohisa Goto From pjotr2008 at thebird.nl Mon Sep 29 20:35:19 2008 From: pjotr2008 at thebird.nl (Pjotr Prins) Date: Mon, 29 Sep 2008 22:35:19 +0200 Subject: [BioRuby] RFC Caching (was BioRuby standards) In-Reply-To: <20080930052354.9B6C.EEF6E030@gen-info.osaka-u.ac.jp> References: <20080925145817.90B1B1CBC3F4@idnmail.gen-info.osaka-u.ac.jp> <20080929123411.GA31668@thebird.nl> <20080930052354.9B6C.EEF6E030@gen-info.osaka-u.ac.jp> Message-ID: <20080929203519.GA5277@thebird.nl> On Tue, Sep 30, 2008 at 05:26:39AM +0900, Naohisa Goto wrote: > "create" will be used, if no good suggestion given. > Though, maybe bioscientists don't know much about design patterns. We oughta teach 'em ;-). But, yes. You are right. Pj. From donttrustben at gmail.com Tue Sep 30 01:55:35 2008 From: donttrustben at gmail.com (Ben Woodcroft) Date: Tue, 30 Sep 2008 11:55:35 +1000 Subject: [BioRuby] Bioruby Website Problems Message-ID: Hi, I am having problems using the bioruby.org site. My first problem was that the fetch function started giving me 404s: >> pdb = Bio::Fetch.new.fetch('PDB','2A06') OpenURI::HTTPError: 404 Not Found from /usr/lib/ruby/1.8/open-uri.rb:277:in `open_http' from /usr/lib/ruby/1.8/open-uri.rb:616:in `buffer_open' from /usr/lib/ruby/1.8/open-uri.rb:164:in `open_loop' from /usr/lib/ruby/1.8/open-uri.rb:162:in `catch' from /usr/lib/ruby/1.8/open-uri.rb:162:in `open_loop' from /usr/lib/ruby/1.8/open-uri.rb:132:in `open_uri' from /home/ben/forays/bioruby_wwood/lib/bio/command.rb:223:in `read_uri' from /home/ben/forays/bioruby_wwood/lib/bio/io/fetch.rb:109:in `fetch' from (irb):13 Went to bioruby.org/rdoc in firefox and that also fails. bioruby.org itself redirects to the Human Genome Center (Tokyo Uni) front page. Thanks, ben -- FYI: My email addresses at unimelb, uq and gmail all redirect to the same place. From donttrustben at gmail.com Tue Sep 30 02:40:39 2008 From: donttrustben at gmail.com (Ben Woodcroft) Date: Tue, 30 Sep 2008 12:40:39 +1000 Subject: [BioRuby] biofetch confusion Message-ID: Hi, I was running a fetch from within ruby using the alternate server, and ran into a problem it took stupid me a little while to figure out. Thought I might post to help others. >> pdb = Bio::Fetch.new('www.ebi.ac.uk/cgi-bin/dbfetch').fetch('pdb','2A06') NoMethodError: You have a nil object when you didn't expect it! The error occurred while evaluating nil.downcase from /usr/lib/ruby/1.8/open-uri.rb:551:in `find_proxy' from /usr/lib/ruby/1.8/open-uri.rb:147:in `open_loop' from /usr/lib/ruby/1.8/open-uri.rb:164:in `call' from /usr/lib/ruby/1.8/open-uri.rb:164:in `open_loop' from /usr/lib/ruby/1.8/open-uri.rb:162:in `catch' from /usr/lib/ruby/1.8/open-uri.rb:162:in `open_loop' from /usr/lib/ruby/1.8/open-uri.rb:132:in `open_uri' from /home/ben/forays/bioruby_wwood/lib/bio/command.rb:223:in `read_uri' from /home/ben/forays/bioruby_wwood/lib/bio/io/fetch.rb:109:in `fetch' from (irb):5 Same problem happens when you use the br_biofetch.rb script directly. The problem was fixed by adding 'http://' to the front of the url: >> pdb = Bio::Fetch.new('http://www.ebi.ac.uk/cgi-bin/dbfetch').fetch('pdb ','2A06') ... pdb printed here ... Should bioruby add the http:// somehow? Thanks, ben -- FYI: My email addresses at unimelb, uq and gmail all redirect to the same place. From ktym at hgc.jp Tue Sep 30 02:47:45 2008 From: ktym at hgc.jp (Toshiaki Katayama) Date: Tue, 30 Sep 2008 11:47:45 +0900 Subject: [BioRuby] biofetch confusion In-Reply-To: References: Message-ID: <8A9B37C3-64AA-4CDA-AB32-E7C4155AECE7@hgc.jp> Hi, > Should bioruby add the http:// somehow? I don't think so. Please add protocol prefix by yourself. Toshiaki On 2008/09/30, at 11:40, Ben Woodcroft wrote: > Hi, > > I was running a fetch from within ruby using the alternate server, and ran > into a problem it took stupid me a little while to figure out. Thought I > might post to help others. > >>> pdb = Bio::Fetch.new('www.ebi.ac.uk/cgi-bin/dbfetch').fetch('pdb','2A06') > NoMethodError: You have a nil object when you didn't expect it! > The error occurred while evaluating nil.downcase > from /usr/lib/ruby/1.8/open-uri.rb:551:in `find_proxy' > from /usr/lib/ruby/1.8/open-uri.rb:147:in `open_loop' > from /usr/lib/ruby/1.8/open-uri.rb:164:in `call' > from /usr/lib/ruby/1.8/open-uri.rb:164:in `open_loop' > from /usr/lib/ruby/1.8/open-uri.rb:162:in `catch' > from /usr/lib/ruby/1.8/open-uri.rb:162:in `open_loop' > from /usr/lib/ruby/1.8/open-uri.rb:132:in `open_uri' > from /home/ben/forays/bioruby_wwood/lib/bio/command.rb:223:in `read_uri' > from /home/ben/forays/bioruby_wwood/lib/bio/io/fetch.rb:109:in `fetch' > from (irb):5 > > > Same problem happens when you use the br_biofetch.rb script directly. > > > The problem was fixed by adding 'http://' to the front of the url: > >>> pdb = Bio::Fetch.new('http://www.ebi.ac.uk/cgi-bin/dbfetch').fetch('pdb > ','2A06') > ... pdb printed here ... > > > Should bioruby add the http:// somehow? > > Thanks, > ben > > -- > FYI: My email addresses at unimelb, uq and gmail all redirect to the same > place. > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From ktym at hgc.jp Tue Sep 30 02:40:56 2008 From: ktym at hgc.jp (Toshiaki Katayama) Date: Tue, 30 Sep 2008 11:40:56 +0900 Subject: [BioRuby] Bioruby Website Problems In-Reply-To: References: Message-ID: <23868174-D2AD-43CC-8048-35827D17BDA5@hgc.jp> Hi, Sorry for any inconveniences. I forgot to care about this, but it is due to our server replacement. The bioruby.org services (including BioFetch server) will be unavailable until Oct 2nd. Meanwhile, I can recommend you to use TogoWS service hosted at http://togows.dbcls.jp/entry/pdb/2A06 which we have developed these months utilizing BioRuby functionality. Regards, Toshiaki Katayama On 2008/09/30, at 10:55, Ben Woodcroft wrote: > Hi, > > I am having problems using the bioruby.org site. My first problem was that > the fetch function started giving me 404s: > >>> pdb = Bio::Fetch.new.fetch('PDB','2A06') > OpenURI::HTTPError: 404 Not Found > from /usr/lib/ruby/1.8/open-uri.rb:277:in `open_http' > from /usr/lib/ruby/1.8/open-uri.rb:616:in `buffer_open' > from /usr/lib/ruby/1.8/open-uri.rb:164:in `open_loop' > from /usr/lib/ruby/1.8/open-uri.rb:162:in `catch' > from /usr/lib/ruby/1.8/open-uri.rb:162:in `open_loop' > from /usr/lib/ruby/1.8/open-uri.rb:132:in `open_uri' > from /home/ben/forays/bioruby_wwood/lib/bio/command.rb:223:in `read_uri' > from /home/ben/forays/bioruby_wwood/lib/bio/io/fetch.rb:109:in `fetch' > from (irb):13 > > Went to bioruby.org/rdoc in firefox and that also fails. bioruby.org itself > redirects to the Human Genome Center (Tokyo Uni) front page. > > Thanks, > ben > > -- > FYI: My email addresses at unimelb, uq and gmail all redirect to the same > place. > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From donttrustben at gmail.com Tue Sep 30 03:47:46 2008 From: donttrustben at gmail.com (Ben Woodcroft) Date: Tue, 30 Sep 2008 13:47:46 +1000 Subject: [BioRuby] Bioruby Website Problems In-Reply-To: <23868174-D2AD-43CC-8048-35827D17BDA5@hgc.jp> References: <23868174-D2AD-43CC-8048-35827D17BDA5@hgc.jp> Message-ID: Thanks for the quick reply. That togows site looks cool. 2008/9/30 Toshiaki Katayama > Hi, > > Sorry for any inconveniences. > > I forgot to care about this, but it is due to our server replacement. > The bioruby.org services (including BioFetch server) will be unavailable > until Oct 2nd. > > Meanwhile, I can recommend you to use TogoWS service hosted at > > http://togows.dbcls.jp/entry/pdb/2A06 > > which we have developed these months utilizing BioRuby functionality. > > Regards, > Toshiaki Katayama > > On 2008/09/30, at 10:55, Ben Woodcroft wrote: > > > Hi, > > > > I am having problems using the bioruby.org site. My first problem was > that > > the fetch function started giving me 404s: > > > >>> pdb = Bio::Fetch.new.fetch('PDB','2A06') > > OpenURI::HTTPError: 404 Not Found > > from /usr/lib/ruby/1.8/open-uri.rb:277:in `open_http' > > from /usr/lib/ruby/1.8/open-uri.rb:616:in `buffer_open' > > from /usr/lib/ruby/1.8/open-uri.rb:164:in `open_loop' > > from /usr/lib/ruby/1.8/open-uri.rb:162:in `catch' > > from /usr/lib/ruby/1.8/open-uri.rb:162:in `open_loop' > > from /usr/lib/ruby/1.8/open-uri.rb:132:in `open_uri' > > from /home/ben/forays/bioruby_wwood/lib/bio/command.rb:223:in > `read_uri' > > from /home/ben/forays/bioruby_wwood/lib/bio/io/fetch.rb:109:in `fetch' > > from (irb):13 > > > > Went to bioruby.org/rdoc in firefox and that also fails. bioruby.orgitself > > redirects to the Human Genome Center (Tokyo Uni) front page. > > > > Thanks, > > ben > > > > -- > > FYI: My email addresses at unimelb, uq and gmail all redirect to the same > > place. > > _______________________________________________ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > -- FYI: My email addresses at unimelb, uq and gmail all redirect to the same place. From donttrustben at gmail.com Tue Sep 30 04:21:12 2008 From: donttrustben at gmail.com (Ben Woodcroft) Date: Tue, 30 Sep 2008 14:21:12 +1000 Subject: [BioRuby] Bio::SPTR bug and fix Message-ID: Hi, So I was trying to parse a uniprot file, and I found that bioruby threw an error when asked it to return a DR key that didn't exist in the uniprot file (in particular, GO annotations when none were defined). I made a branch that fixes this by returning [] in that situation, and added a test for it as well: http://github.com/wwood/bioruby/tree/sptr_fix If this code is good enough then can I request it be merged into the tree? Thanks, ben -- FYI: My email addresses at unimelb, uq and gmail all redirect to the same place. From ngoto at gen-info.osaka-u.ac.jp Tue Sep 30 09:05:44 2008 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Tue, 30 Sep 2008 18:05:44 +0900 Subject: [BioRuby] Bio::SPTR bug and fix In-Reply-To: References: Message-ID: <20080930090545.2D0B81CBC3AB@idnmail.gen-info.osaka-u.ac.jp> Thank you. I modified your patch and committed to my repository. http://github.com/ngoto/bioruby/commit/6299d291b925442d828ff2a95c4526c45dc62208 It will soon be merged to the main bioruby git repo. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Tue, 30 Sep 2008 14:21:12 +1000 "Ben Woodcroft" wrote: > Hi, > > So I was trying to parse a uniprot file, and I found that bioruby threw an > error when asked it to return a DR key that didn't exist in the uniprot file > (in particular, GO annotations when none were defined). > > I made a branch that fixes this by returning [] in that situation, and added > a test for it as well: > http://github.com/wwood/bioruby/tree/sptr_fix > > If this code is good enough then can I request it be merged into the tree? > > Thanks, > ben > > -- > FYI: My email addresses at unimelb, uq and gmail all redirect to the same > place. > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From donttrustben at gmail.com Tue Sep 30 23:09:06 2008 From: donttrustben at gmail.com (Ben Woodcroft) Date: Wed, 1 Oct 2008 09:09:06 +1000 Subject: [BioRuby] Bio::SPTR bug and fix In-Reply-To: <20080930090545.2D0B81CBC3AB@idnmail.gen-info.osaka-u.ac.jp> References: <20080930090545.2D0B81CBC3AB@idnmail.gen-info.osaka-u.ac.jp> Message-ID: Thanks. You are the first person to call me Dr. Ben Woodcroft - while I don't mind the sound of that I'm still a first year PhD student. ben 2008/9/30 Naohisa GOTO > Thank you. > > I modified your patch and committed to my repository. > > > http://github.com/ngoto/bioruby/commit/6299d291b925442d828ff2a95c4526c45dc62208 > > It will soon be merged to the main bioruby git repo. > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > On Tue, 30 Sep 2008 14:21:12 +1000 > "Ben Woodcroft" wrote: > > > Hi, > > > > So I was trying to parse a uniprot file, and I found that bioruby threw > an > > error when asked it to return a DR key that didn't exist in the uniprot > file > > (in particular, GO annotations when none were defined). > > > > I made a branch that fixes this by returning [] in that situation, and > added > > a test for it as well: > > http://github.com/wwood/bioruby/tree/sptr_fix > > > > If this code is good enough then can I request it be merged into the > tree? > > > > Thanks, > > ben > > > > -- > > FYI: My email addresses at unimelb, uq and gmail all redirect to the same > > place. > > _______________________________________________ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > -- FYI: My email addresses at unimelb, uq and gmail all redirect to the same place.