From yannick.wurm at unil.ch Tue Nov 3 09:11:52 2009 From: yannick.wurm at unil.ch (Yannick Wurm) Date: Tue, 3 Nov 2009 15:11:52 +0100 Subject: [BioRuby] Ruby speed In-Reply-To: References: Message-ID: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch> Hi, this is a more general ruby question, but since my application is bioinformatics, I'm posting it here. Just wanted to prepend a few characters in front of FASTA identifiers. $time cat clustering/dirsForAssembly/singlets.fasta | ruby -pe "gsub(/ ^>/, '>MyPrefix')" > abc real 0m20.379s user 0m0.741s sys 0m0.168s While the perl equivalent is one heck of a lot faster!!! $time cat clustering/dirsForAssembly/singlets.fasta | perl -p -i -e 's/ ^>/>MyPrefix/g' > ab real 0m2.165s user 0m0.266s sys 0m0.146s Is there any hope for ruby? Thanks, yannick -------------------------------------------- yannick . wurm @ unil . ch Ant Genomics, Ecology & Evolution @ Lausanne http://www.unil.ch/dee/page28685_fr.html From yannick.wurm at unil.ch Tue Nov 3 17:49:12 2009 From: yannick.wurm at unil.ch (Yannick Wurm) Date: Tue, 3 Nov 2009 23:49:12 +0100 Subject: [BioRuby] Ruby speed In-Reply-To: References: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch> Message-ID: <626B2E8A-8E3B-4643-8104-DC1143121523@unil.ch> Hi Mike, thanks for your response. I'm running: ruby 1.8.6 (2008-03-03 patchlevel 114) [x86_64-linux] Starting to age, but on a production machine I'd rather stay with what works than risk breaking things by upgrading them. the command sed 's/^>/>MyPrefix/' is indeed 30% faster than perl :) My reasons for preferring ruby are the same as yours. But a 5 to 10x speed difference is expensive (I'm calling the one-liner below about 10,000 times from a larger ruby script - YES, it's ugly, but refactoring the script to avoid calling that type of oneliner would be a pain since I use 10,000 different prefixes). I have the feeling that it's ruby's startup-time especially. Running the ruby one-liner my a fasta of 40,000 sequences takes 20 seconds; running it a fasta of only 10 lines still takes 13 seconds!! I found some generic benchmarks indicating that ruby is generally only a bit slower than perl http://shootout.alioth.debian.org/u32/benchmark.php?test=all&lang=ruby&lang2=perl So maybe I can keep using ruby - just avoiding one-liners! Best, yannick On 3 Nov 2009, at 22:26, Michael Barton wrote: > What version of Ruby are you using? > Ruby is an expressive language rather than a "fast" language. > I use Ruby because it's easer to read and maintain my programs, rather > than because how fast it is. > > If you are interested purely in speed you could write in C? > What are the benchmarks for something like this? > > time sed 's/^>/>MyPrefix.' clustering/dirsForAssembly/singlets.fasta > > abc > > Mike > > 2009/11/3 Yannick Wurm : >> Hi, >> >> this is a more general ruby question, but since my application is >> bioinformatics, I'm posting it here. >> >> Just wanted to prepend a few characters in front of FASTA >> identifiers. >> >> >> $time cat clustering/dirsForAssembly/singlets.fasta | ruby -pe >> "gsub(/^>/, >> '>MyPrefix')" > abc >> real 0m20.379s >> user 0m0.741s >> sys 0m0.168s >> >> >> While the perl equivalent is one heck of a lot faster!!! >> >> >> $time cat clustering/dirsForAssembly/singlets.fasta | perl -p -i -e >> 's/^>/>MyPrefix/g' > ab >> real 0m2.165s >> user 0m0.266s >> sys 0m0.146s >> >> >> Is there any hope for ruby? >> >> Thanks, >> yannick >> >> >> -------------------------------------------- >> yannick . wurm @ unil . ch >> Ant Genomics, Ecology & Evolution @ Lausanne >> http://www.unil.ch/dee/page28685_fr.html >> >> >> _______________________________________________ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby >> From juanfc at uma.es Tue Nov 3 17:44:10 2009 From: juanfc at uma.es (Juan Falgueras) Date: Tue, 3 Nov 2009 23:44:10 +0100 Subject: [BioRuby] Ruby speed In-Reply-To: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch> References: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch> Message-ID: <87CAA48B-151F-41C3-9DF5-23C4B43BDFD0@uma.es> Hi, have you tried it with Ruby 1.9? El 03/11/2009, a las 15:11, Yannick Wurm escribi?: > Hi, > > this is a more general ruby question, but since my application is > bioinformatics, I'm posting it here. > > Just wanted to prepend a few characters in front of FASTA identifiers. > > > $time cat clustering/dirsForAssembly/singlets.fasta | ruby -pe "gsub > (/^>/, '>MyPrefix')" > abc > real 0m20.379s > user 0m0.741s > sys 0m0.168s > > > While the perl equivalent is one heck of a lot faster!!! > > > $time cat clustering/dirsForAssembly/singlets.fasta | perl -p -i -e > 's/^>/>MyPrefix/g' > ab > real 0m2.165s > user 0m0.266s > sys 0m0.146s > > > Is there any hope for ruby? > > Thanks, > yannick > > > -------------------------------------------- > yannick . wurm @ unil . ch > Ant Genomics, Ecology & Evolution @ Lausanne > http://www.unil.ch/dee/page28685_fr.html > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From trevor at corevx.com Tue Nov 3 18:18:50 2009 From: trevor at corevx.com (Trevor Wennblom) Date: Tue, 3 Nov 2009 17:18:50 -0600 Subject: [BioRuby] Ruby speed In-Reply-To: <626B2E8A-8E3B-4643-8104-DC1143121523@unil.ch> References: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch> <626B2E8A-8E3B-4643-8104-DC1143121523@unil.ch> Message-ID: On Nov 3, 2009, at 4:49 PM, Yannick Wurm wrote: > I found some generic benchmarks indicating that ruby is generally > only a bit slower than perl > http://shootout.alioth.debian.org/u32/benchmark.php?test=all&lang=ruby&lang2=perl http://shootout.alioth.debian.org/u32/benchmark.php?test=all&lang=yarv&lang2=perl&box=1 From robert.citek at gmail.com Tue Nov 3 20:32:12 2009 From: robert.citek at gmail.com (Robert Citek) Date: Tue, 3 Nov 2009 20:32:12 -0500 Subject: [BioRuby] Ruby speed In-Reply-To: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch> References: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch> Message-ID: <4145b6790911031732m731d0b09o199041ab0feb610c@mail.gmail.com> On Tue, Nov 3, 2009 at 9:11 AM, Yannick Wurm wrote: > this is a more general ruby question, but since my application is > bioinformatics, I'm posting it here. > > Just wanted to prepend a few characters in front of FASTA identifiers. > > $time cat clustering/dirsForAssembly/singlets.fasta | ruby -pe "gsub(/^>/, > '>MyPrefix')" > abc > ? ? ? ?real ? ?0m20.379s > ? ? ? ?user ? ?0m0.741s > ? ? ? ?sys ? ? 0m0.168s > > > While the perl equivalent is one heck of a lot faster!!! > > > $time cat clustering/dirsForAssembly/singlets.fasta | perl -p -i -e > 's/^>/>MyPrefix/g' > ab > ? ? ? ?real ? ?0m2.165s > ? ? ? ?user ? ?0m0.266s > ? ? ? ?sys ? ? 0m0.146s > > > Is there any hope for ruby? I get a factor of about three on a 10,000,000 line FASTA file: $ time -p yes ">foo"$'\n'"bar" | head -10000000 | ruby -pe "gsub(/^>/, '>MyPrefix')" > /dev/null real 42.99 user 43.39 sys 0.63 $ time -p yes ">foo"$'\n'"bar" | head -10000000 | perl -pe 's/^>/>MyPrefix/g' > /dev/null real 15.89 user 16.33 sys 0.26 This is with perl 5.8.8 and ruby 1.8.6 on a dual 1.6 GHz CPU with 512 MB RAM. Notice your user and system times are less than a factor of three. It's only the real time that is 10x, which suggests that ruby is waiting on other processes, e.g. disk reads. Regards, - Robert From pjotr.public14 at thebird.nl Wed Nov 4 05:22:45 2009 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Wed, 4 Nov 2009 11:22:45 +0100 Subject: [BioRuby] Ruby speed In-Reply-To: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch> References: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch> Message-ID: <20091104102245.GA13264@thebird.nl> On Tue, Nov 03, 2009 at 03:11:52PM +0100, Yannick Wurm wrote: > Is there any hope for ruby? I guess you mean this tongue in cheek. However, it is dangerous as it may turn off users looking to start with Ruby or Perl. So let me state I think there is plenty of hope for Ruby. You are talking execution speed of 'simple' oneliners. For complex programming Ruby outspeeds Perl, usually in practise. Particularly the speed of getting things done, but also a cleaner way of programming helps create better code. The end result will often be faster. And the third gain is in the code maintenance cycle. I am talking from experience here. I have written a lot of code in both languages (and Python too). Perl6 is getting interesting. The syntax is much cleaned up, proper OOP, and (what I like) strong functional programming support. But its execution speed is not even close to Ruby's now. I have heard people joke that Ruby is what Perl6 was meant to be. Anyway you can see where the Perl folks are heading. Pj. P.S. What is there to stop you from using both languages? From mail at michaelbarton.me.uk Wed Nov 4 06:24:36 2009 From: mail at michaelbarton.me.uk (Michael Barton) Date: Wed, 4 Nov 2009 11:24:36 +0000 Subject: [BioRuby] Ruby speed In-Reply-To: <626B2E8A-8E3B-4643-8104-DC1143121523@unil.ch> References: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch> <626B2E8A-8E3B-4643-8104-DC1143121523@unil.ch> Message-ID: 2009/11/3 Yannick Wurm : > thanks for your response. I'm running: > ruby 1.8.6 (2008-03-03 patchlevel 114) [x86_64-linux] > Starting to age, but on a production machine I'd rather stay with what works > than risk breaking things by upgrading them. I think Ruby 1.9 is now the official Ruby release, so you might want to start trying out using this version, for example Rails 3.0 won't work with Ruby 1.8.6 anymore. I've tried Ruby 1.9 a bit myself and the requirements for compatibility are relatively small. If you still prefer to use 1.8, you could try using REE (http://www.rubyenterpriseedition.com/) which has a few patches to improve performance over vanilla 1.8. You could try using ruby_switcher which makes trying different ruby versions a bit less painful - http://bit.ly/1kY1Qk > the command sed 's/^>/>MyPrefix/' is indeed 30% faster than perl :) Could you just try calling out to sed then? > I have the feeling that it's ruby's startup-time especially. Running the > ruby one-liner my a fasta of 40,000 sequences takes 20 seconds; running it a > fasta of only 10 lines still takes 13 seconds!! You might also want to try experimenting with gsub! instead of gsub as the former does destructive in place substitution while the latter creates an extra object with the substituted text. This extra object creation might also slow performance. Cheers Mike From diapriid at gmail.com Wed Nov 4 13:29:13 2009 From: diapriid at gmail.com (Matt) Date: Wed, 4 Nov 2009 13:29:13 -0500 Subject: [BioRuby] basic remote query and plans for NCBI Bio:Blast hook? Message-ID: <19d6b9770911041029w1142e68ag1d739675c75514f9@mail.gmail.com> Hi all, As far as I can tell there is yet no straightforward way to use Bio:Blast with the NCBI portal? I've seen this on the wiki: "Add remote BLAST search sites", and understand the basic concept, but don't have time at present to work on this. Is anyone actively working on this? (just FYI see http://github.com/kwicher/ruby-blast-at-ncbi). I ask in part because I'm struggling to get a basic remote blast working: seq = Bio::Sequence::NA.new('GTCACAAAATCATGGTTTTGCGGTTAATGCTAATGATTTGCCAGCTGATTGGGAACCATTATTTACAAATGCGAACGACAATACAAATGAAGGAATTGTACACAAAACACATCCATTCTTTAGTGTACAATTTCATCCCGAACACACAGCCGGTCCAGAAGATTTAGAAATCTTATTTGATGTCTTTCTGGATGGAGTAAAAGCATTTAAAAATAAGGAAAAGTTCAYCATGAARGATAAATTGATCGAAAAATTGACTTACACGCCGGATGTACCCGTTTGCACTGAAAAACCTAAAAAGATATTGATTTTAGGTTCAGGCGGTTTATCCATAGGYCAAGCAGGCGAATTTGATTATTCCGGATCTCAGGCTATCAAGGCTCTTAAAGAAGAAAAAATACAAACGGTGYTAATAAATCCAAATATTGCAACGGTTCARACATCAAAAGGCCTTGCGGACAAAGTTTACTTCCTACCCATTACACCGGATTACGTTGAACAGGTTATAAAAGCCGAGCGACCTGATGGTGTGCTTTTAACTTTTGGCGGACAAACAGCTTTGAATTGTGGAATTGAATTAGAAAAAACTAAAGTGTTTCAACGATTCGGTGTTAAAGTGTTGGGTACRCCGATACAATCAATTATTGAAACTGAAGATAGAAAAATATTTTCGGATCGAGTACACGAAATCGGAGAAAAAGTAGCGCCGTCTGCCGCAGTTTATTCGGTGCAAGAAGCTCTAGATGCCGCTGAAATTCTTGGTTATCCCGTTATGGCTCGAGCTGCATTTTCATTAGGTGGACTAGGTTCTGGTTTTGCAAATAATATTGATGAATTAAAACATCTTGCACAACAGGCTCTTGCGCATTCCAACCAGTTAATCATTGATAAATCGCTTAAAGGTTGGAAGGAAGTTGAATACGAGGTCGTTCGTGATGCATATGACAATTGTATTACAGTTTGTAATATGGAAAATGTAGATCCACTAGGAATTCATACAGGGGAGAGTATAGTAGTGGCACCGTCACAAACTCTCTCCAACAAGGAATATAATATGTTGCGTACTACAGCAATTAAAGTGATTCGGCATTTTGGCGTCGTCGGTGAATGTAATATACAATATGCCTTAAATCCACATTCYGAGCAATACTATATAATTGAAGTTAATGCTAGGTTATCGAGGAGTTCGGCACTAGCTAGTAAAGCGACAGGCTATCCATTAGCATACGTTGCGGCTAAACTAGCACTCGGTATCGCTTTACCTGATATTAAAAATTCGGTAACTGGAGTTACCACCGCCTGTTTTGAGCCAAGTTTAGATTACTGTGTGGTAAAAATTCCACGATGGGATTTAGCAAAATTTGTTCGCGTTTCAAAAAATATTGGAAGCTCTATGAAAAGTGTAGGTGAGGTCATGGCAATCGGCCGCCGATTTGAAGAAGCGTTCCAAAAA') blast_factory = Bio::Blast.new('blastn','nr-nt', '', 'genomenet') foo = blast_factory.query(seq) ... freezes, when I ctrl-C from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast/genomenet.rb:224:in `call' from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast/genomenet.rb:224:in `sleep' from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast/genomenet.rb:224:in `exec_genomenet' from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast.rb:368:in `__send__' from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast.rb:368:in `query' from (irb):25 any glaring problems with this? Is it just waiting for the results of the remote query? I noticed that the genomenet blasts are much slower than NCBI in general (I'm in the US). thanks, Matt From diapriid at gmail.com Wed Nov 4 14:57:11 2009 From: diapriid at gmail.com (Matt) Date: Wed, 4 Nov 2009 14:57:11 -0500 Subject: [BioRuby] (previous answered in part) timeout/long time Message-ID: <19d6b9770911041157l1556ac89s4e8c62ad2e20460d@mail.gmail.com> Aha- my queries *are* working, just taking a very long time to finish. Can I limit to say top 10 results? cheers, Matt From yannick.wurm at unil.ch Wed Nov 4 14:56:13 2009 From: yannick.wurm at unil.ch (Yannick Wurm) Date: Wed, 4 Nov 2009 20:56:13 +0100 Subject: [BioRuby] Ruby speed Message-ID: <81E8B742-2508-40DF-8E81-07F1C8126839@unil.ch> > Notice your user and system times are less than a factor of three. > It's only the real time that is 10x, which suggests that ruby is > waiting on other processes, e.g. disk reads. Great point Robert - I hadn't seen that. My guess the difference is due to the fact that ruby is only installed in my networked (sfs) home dir on the linux server, not on the local machine like perl is. Gotta get the sysadmins to install ruby :) cheers! yannick From email2ants at gmail.com Thu Nov 5 11:22:12 2009 From: email2ants at gmail.com (Anthony Underwood) Date: Thu, 5 Nov 2009 16:22:12 +0000 Subject: [BioRuby] basic remote query and plans for NCBI Bio:Blast hook? In-Reply-To: <19d6b9770911041029w1142e68ag1d739675c75514f9@mail.gmail.com> References: <19d6b9770911041029w1142e68ag1d739675c75514f9@mail.gmail.com> Message-ID: <86C24368-84E1-4A43-ABBD-A26B998159B2@gmail.com> Hi Matt I have done a bit of work to get NCBI blast working within bioruby. See this gist on github http://gist.github.com/227160 ncbi_blast.rb defines an exec_ncbi class for the Blast class in bioruby The script ncbi_blast_test.rb illustrates its usage but uses a few functions defined in the blast_functions.rb file essentially the following should work require 'rubygems' require 'bio' require 'ncbi_blast' ENV['http_proxy'] = "http://proxy_server_ip:port_numer" # use this if you are working from behind a proxy and enter ip and port number as appropriate sequence = "ATGAATCCAAATCAGAAAATAATAA........" factory = Bio::Blast.remote('blastn', 'nr', '', 'ncbi') blast_report = factory.query(sequence) blast_report will be a Bio::Blast::Report object which can be parsed as described in the bioruby api The hit definitions are fairly uninformative containing just the accessions. This is why I then have to fetch the data fro embl as follows accession = definition.split("|")[3] accession.sub!(/\..+$/, "") # remove version number server = Bio::Fetch.new('http://www.ebi.ac.uk/cgi-bin/dbfetch') embl_text = server.fetch('embl', accession) embl_object = Bio::EMBL.new(embl_text) puts embl_object.description This is still a work in progress but it worked OK for me. Hope it is of some use to you. Anthony On 4 Nov 2009, at 18:29, Matt wrote: > Hi all, > > As far as I can tell there is yet no straightforward way to use > Bio:Blast with the NCBI portal? I've seen this on the wiki: "Add > remote BLAST search sites", and understand the basic concept, but > don't have time at present to work on this. Is anyone actively > working on this? (just FYI see > http://github.com/kwicher/ruby-blast-at-ncbi). > > I ask in part because I'm struggling to get a basic remote blast > working: > > seq = > Bio::Sequence::NA.new('GTCACAAAATCATGGTTTTGCGGTTAATGCTAATGATTTGCCAGCTGATTGGGAACCATTATTTACAAATGCGAACGACAATACAAATGAAGGAATTGTACACAAAACACATCCATTCTTTAGTGTACAATTTCATCCCGAACACACAGCCGGTCCAGAAGATTTAGAAATCTTATTTGATGTCTTTCTGGATGGAGTAAAAGCATTTAAAAATAAGGAAAAGTTCAYCATGAARGATAAATTGATCGAAAAATTGACTTACACGCCGGATGTACCCGTTTGCACTGAAAAACCTAAAAAGATATTGATTTTAGGTTCAGGCGGTTTATCCATAGGYCAAGCAGGCGAATTTGATTATTCCGGATCTCAGGCTATCAAGGCTCTTAAAGAAGAAAAAATACAAACGGTGYTAATAAATCCAAATATTGCAACGGTTCARACATCAAAAGGCCTTGCGGACAAAGTTTACTTCCTACCCATTACACCGGATTACGTTGAACAGGTTATAAAAGCCGAGCGACCTGATGGTGTGCTTTTAACTTTTGGCGGACAAACAGCTTTGAATTGTGGAATTGAATTAGAAAAAACTAAAGTGTTTCAACGATTCGGTGTTAAAGTGTTGGGTACRCCGATACAATCAATTATTGAAACTGAAGATAGAAAAATATTTTCGGATCGAGTACACGAAATCGGAGAAAAAGTAGCGCCGTCTGCCGCAGTTTATTCGGTGCAAGAAGCTCTAGATGCCGCTGAAATTCTTGGTTATCCCGTTATGGCTCGAGCTGCATTTTCATTAGGTGGACTAGGTTCTGGTTTTGCAAATAATATTGATGAATTAAAACATCTTGCACAACAGGCTCTTGCGCATTCCAACCAGTTAATCATTGATAAATCGCTTAAAGGTTGGAAGGAAGTTGAATACGAGGTCGTTCGTGATGCATATGACAATTGTATTACAGT! > TTGTAATATGGAAAATGTAGATCCACTAGGAATTCATACAGGGGAGAGTATAGTAGTGGCACCGTCACAAACTCTCTCCAACAAGGAATATAATATGTTGCGTACTACAGCAATTAAAGTGATTCGGCATTTTGGCGTCGTCGGTGAATGTAATATACAATATGCCTTAAATCCACATTCYGAGCAATACTATATAATTGAAGTTAATGCTAGGTTATCGAGGAGTTCGGCACTAGCTAGTAAAGCGACAGGCTATCCATTAGCATACGTTGCGGCTAAACTAGCACTCGGTATCGCTTTACCTGATATTAAAAATTCGGTAACTGGAGTTACCACCGCCTGTTTTGAGCCAAGTTTAGATTACTGTGTGGTAAAAATTCCACGATGGGATTTAGCAAAATTTGTTCGCGTTTCAAAAAATATTGGAAGCTCTATGAAAAGTGTAGGTGAGGTCATGGCAATCGGCCGCCGATTTGAAGAAGCGTTCCAAAAA') > > blast_factory = Bio::Blast.new('blastn','nr-nt', '', 'genomenet') > foo = blast_factory.query(seq) > > ... freezes, when I ctrl-C > > from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast/ > genomenet.rb:224:in > `call' > from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast/ > genomenet.rb:224:in > `sleep' > from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast/ > genomenet.rb:224:in > `exec_genomenet' > from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/ > blast.rb:368:in > `__send__' > from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/ > blast.rb:368:in > `query' > from (irb):25 > > any glaring problems with this? Is it just waiting for the results of > the remote query? I noticed that the genomenet blasts are much > slower than NCBI in general (I'm in the US). > > thanks, > Matt > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From kenglish at gmail.com Thu Nov 5 11:43:31 2009 From: kenglish at gmail.com (Kevin English) Date: Thu, 5 Nov 2009 06:43:31 -1000 Subject: [BioRuby] basic remote query and plans for NCBI Bio:Blast hook? In-Reply-To: <19d6b9770911041029w1142e68ag1d739675c75514f9@mail.gmail.com> References: <19d6b9770911041029w1142e68ag1d739675c75514f9@mail.gmail.com> Message-ID: Have you considered downloading the nr-nt databases and running local queries? I played with the Blast Remote for a while but determined it was too slow for our workload... Kevin From yannick.wurm at unil.ch Thu Nov 5 15:06:33 2009 From: yannick.wurm at unil.ch (Yannick Wurm) Date: Thu, 5 Nov 2009 21:06:33 +0100 Subject: [BioRuby] BioRuby Digest, Vol 50, Issue 1 In-Reply-To: References: Message-ID: On 4 Nov 2009, at 18:00, bioruby-request at lists.open-bio.org wrote: > I guess you mean this tongue in cheek. However, it is dangerous as it > may turn off users looking to start with Ruby or Perl. So let me state > I think there is plenty of hope for Ruby. You are talking execution > speed of 'simple' oneliners. For complex programming Ruby outspeeds > Perl, usually in practise. Particularly the speed of getting things > done, but also a cleaner way of programming helps create better code. > The end result will often be faster. And the third gain is in the code > maintenance cycle. I am talking from experience here. I have written > a lot of code in both languages (and Python too). Those are excellent points, Pjotr. > Perl6 is getting interesting. The syntax is much cleaned up, proper > OOP, and (what I like) strong functional programming support. But its > execution speed is not even close to Ruby's now. I have heard people > joke that Ruby is what Perl6 was meant to be. > > Anyway you can see where the Perl folks are heading. Yes, Damion Conway of Perl Best Practices gave us a small workshop recently, and I could help but thinking that Perl6 was an attempt to rubify perl :) > P.S. What is there to stop you from using both languages? Nothing official. But I already find it difficult to keep the R, bash and ruby parts of my brain optimized without mixing in perl and others :) Cheers, yannick From rob.syme at gmail.com Thu Nov 5 21:55:45 2009 From: rob.syme at gmail.com (Rob Syme) Date: Fri, 6 Nov 2009 10:55:45 +0800 Subject: [BioRuby] Parsing large blastout.xml files Message-ID: I'm trying to extract information from a large blast xml file. To parse the xml file, ruby reads the whole file into memory before looking at each entry. For large files (2.5GBish) - the memory requirements become severe. My first approach was to split each query up into its own xml instance, so that Would end up looking more like: Which bioruby has trouble parsing, so the s had to be given their own file: $ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}' Now each file can be parsed individually. I feel like there has to be an easier way. Is there a way to parse large xml files without huge memory overheads, or is that just par for the course? From rozziite at gmail.com Thu Nov 5 22:11:32 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Thu, 5 Nov 2009 22:11:32 -0500 Subject: [BioRuby] Parsing large blastout.xml files In-Reply-To: References: Message-ID: <4057d3bf0911051911v1b98a2c7ka17392afb6204e37@mail.gmail.com> Another option is to use ruby-libxml reader. http://libxml.rubyforge.org/rdoc/index.html It reads the data sequentially thus there is no memory overhead of first reading it all in memory. However, then you would have to parse it from scratch. On that note, maybe it is worth implementing Bio::Blast::Report.libxml or something like that the same way there is Bio::Blast::Report.rexml and Bio::Blast::Report.xmlparser. Dependecy to ruby-libxml in BioRuby library was introducted in PhyloXML parser. Diana On Thu, Nov 5, 2009 at 9:55 PM, Rob Syme wrote: > I'm trying to extract information from a large blast xml file. To parse the > xml file, ruby reads the whole file into memory before looking at each > entry. For large files (2.5GBish) - the memory requirements become severe. > > My first approach was to split each query up into its own xml > instance, so that > > > ? > ? ? > ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? > ? ? > ? ? > ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? > ? ? > ? ? > ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? > ? ? > ? > > > Would end up looking more like: > > ? > ? ? > ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? > ? ? > ? > > > > ? > ? ? > ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? > ? ? > ? > > > > ? > ? ? > ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? > ? ? > ? > > > Which bioruby has trouble parsing, so the s had to be given > their own file: > > $ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}' > > Now each file can be parsed individually. I feel like there has to be an > easier way. Is there a way to parse large xml files without huge memory > overheads, or is that just par for the course? > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From adamnkraut at gmail.com Thu Nov 5 22:17:02 2009 From: adamnkraut at gmail.com (Adam) Date: Thu, 5 Nov 2009 22:17:02 -0500 Subject: [BioRuby] Parsing large blastout.xml files In-Reply-To: References: Message-ID: <134ede0b0911051917nf0877e0y8df95c3147a24d07@mail.gmail.com> You might want to try a SAX Parser instead. REXML from the standard library has a streaming API. LibXML is a lot faster and it's available as a gem. http://libxml.rubyforge.org/ On Thu, Nov 5, 2009 at 9:55 PM, Rob Syme wrote: > I'm trying to extract information from a large blast xml file. To parse the > xml file, ruby reads the whole file into memory before looking at each > entry. For large files (2.5GBish) - the memory requirements become severe. > > My first approach was to split each query up into its own xml > instance, so that > > > > > > > > > > > > > > > > > > > > > > > > > > > > Would end up looking more like: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Which bioruby has trouble parsing, so the s had to be given > their own file: > > $ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}' > > Now each file can be parsed individually. I feel like there has to be an > easier way. Is there a way to parse large xml files without huge memory > overheads, or is that just par for the course? > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From pjotr.public14 at thebird.nl Fri Nov 6 03:58:15 2009 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Fri, 6 Nov 2009 09:58:15 +0100 Subject: [BioRuby] Parsing large blastout.xml files In-Reply-To: <4057d3bf0911051911v1b98a2c7ka17392afb6204e37@mail.gmail.com> References: <4057d3bf0911051911v1b98a2c7ka17392afb6204e37@mail.gmail.com> Message-ID: <20091106085815.GA12244@thebird.nl> Diana is right. We need to revamp the implementation for big results. Not only that, the current implementation has method names do not match the BLAST names. I need something like this pretty soon and was thinking of writing it. Pj. On Thu, Nov 05, 2009 at 10:11:32PM -0500, Diana Jaunzeikare wrote: > Another option is to use ruby-libxml reader. > http://libxml.rubyforge.org/rdoc/index.html It reads the data > sequentially thus there is no memory overhead of first reading it all > in memory. However, then you would have to parse it from scratch. > > On that note, maybe it is worth implementing Bio::Blast::Report.libxml > or something like that the same way there is Bio::Blast::Report.rexml > and Bio::Blast::Report.xmlparser. Dependecy to ruby-libxml in BioRuby > library was introducted in PhyloXML parser. > > Diana > > On Thu, Nov 5, 2009 at 9:55 PM, Rob Syme wrote: > > I'm trying to extract information from a large blast xml file. To parse the > > xml file, ruby reads the whole file into memory before looking at each > > entry. For large files (2.5GBish) - the memory requirements become severe. > > > > My first approach was to split each query up into its own xml > > instance, so that > > > > > > ? > > ? ? > > ? ? ? > > ? ? ? ? > > ? ? ? ? > > ? ? ? ? > > ? ? ? > > ? ? > > ? ? > > ? ? ? > > ? ? ? ? > > ? ? ? ? > > ? ? ? ? > > ? ? ? > > ? ? > > ? ? > > ? ? ? > > ? ? ? ? > > ? ? ? ? > > ? ? ? ? > > ? ? ? > > ? ? > > ? > > > > > > Would end up looking more like: > > > > ? > > ? ? > > ? ? ? > > ? ? ? ? > > ? ? ? ? > > ? ? ? ? > > ? ? ? > > ? ? > > ? > > > > > > > > ? > > ? ? > > ? ? ? > > ? ? ? ? > > ? ? ? ? > > ? ? ? ? > > ? ? ? > > ? ? > > ? > > > > > > > > ? > > ? ? > > ? ? ? > > ? ? ? ? > > ? ? ? ? > > ? ? ? ? > > ? ? ? > > ? ? > > ? > > > > > > Which bioruby has trouble parsing, so the s had to be given > > their own file: > > > > $ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}' > > > > Now each file can be parsed individually. I feel like there has to be an > > easier way. Is there a way to parse large xml files without huge memory > > overheads, or is that just par for the course? > > _______________________________________________ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From pjotr.public14 at thebird.nl Sat Nov 7 02:42:44 2009 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sat, 7 Nov 2009 08:42:44 +0100 Subject: [BioRuby] Parsing large blastout.xml files In-Reply-To: References: Message-ID: <20091107074244.GA22748@thebird.nl> I did the same a while back using xmltwig: http://github.com/pjotrp/biotools/blob/master/bin/blast_split_xml On Fri, Nov 06, 2009 at 10:55:45AM +0800, Rob Syme wrote: > I'm trying to extract information from a large blast xml file. To parse the > xml file, ruby reads the whole file into memory before looking at each > entry. For large files (2.5GBish) - the memory requirements become severe. > > My first approach was to split each query up into its own xml > instance, so that > > > > > > > > > > > > > > > > > > > > > > > > > > > > Would end up looking more like: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Which bioruby has trouble parsing, so the s had to be given > their own file: > > $ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}' > > Now each file can be parsed individually. I feel like there has to be an > easier way. Is there a way to parse large xml files without huge memory > overheads, or is that just par for the course? > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From djaunzei at smith.edu Sat Nov 7 22:50:26 2009 From: djaunzei at smith.edu (Diana Jaunzeikare) Date: Sat, 7 Nov 2009 22:50:26 -0500 Subject: [BioRuby] BioRuby Phyloxml update Message-ID: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com> Hi all, So finally I have updated Bio::Tree and Bio::Node classes to improve the phyloxml writer speed. * Added Bio::Node::parent and Bio::Node::children (array of nodes) in order to avoid calling Tree::parent(node) or Tree::children(node), because those methods call breath first search on the underlying graph, which makes PhyloXML writer and parser incredibly slow. In contrast, Bio::Node::parent and Bio::Node::children keeps references to the respective nodes. * Updated Tree::add_edge, Tree::clear_edge, Tree::remove_edge, Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep track of Node::parent and Node::children nodes correctly. Have I forgotten anything? * Now for PhyloXML writer it takes less than 1 second instead of ~20minutes to write ncbi_taxonomy_mollusca.xml file 1.5MB * To write the tree of life taxonomy file (~46MB) it takes 10 seconds (On 2.4GHz, 2.9GB RAM, running Ubuntu) The code is in http://github.com/latvianlinuxgirl/bioruby/tree/tree_class I wrote unit tests for my changes and made sure my changes don't break anything else. However, does anybody has code laying around that uses Tree::parent and Tree::children methods so that I can test it more thoroughly? Cheers, Diana From ngoto at gen-info.osaka-u.ac.jp Sun Nov 8 07:50:56 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa Goto) Date: Sun, 08 Nov 2009 21:50:56 +0900 Subject: [BioRuby] BioRuby Phyloxml update In-Reply-To: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com> References: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com> Message-ID: <20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp> Hi Diana, I'm sorry that the changes cannot be accepted, because the modification of existing Bio::Tree methods breaks things. Bio::Tree does not want to have children/parent information in nodes. One of the reasons is that it is difficult to keep consistency when copying a tree. Nodes can be shared with two or more trees when copying a tree by using "dup" or "clone" method. Normally, tests for existing classes shold not be modified except when changing specification or the test's bug, because they guarantee specification of the class. Adding new tests are OK. If you really want nodes to have parent/children information in each node, please do so in only PhyloXML classes (though I'm negative). In this case, the problem is that reading phyloxml data and write back again seems good, but it seems there are currently no way to convert Bio::Tree to PhyloXML. Now, it seems hard to convert Newick data to PhyloXML. Now, to prepare to include your PhyloXML code in BioRuby, I'm working on my branch. Some API changes will be made. http://github.com/ngoto/bioruby/tree/incoming Note that in your test code, argument order of assert_equal is wrong. I've already fixed in my branch. http://github.com/ngoto/bioruby/commit/a291af62ef262ee04f3a0e1b6415d4e256c56a94 > * Updated Tree::add_edge, Tree::clear_edge, Tree::remove_edge, > Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep > track of Node::parent and Node::children nodes correctly. Have I > forgotten anything? Changing root with tree.root=(). -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > Hi all, > > So finally I have updated Bio::Tree and Bio::Node classes to improve > the phyloxml writer speed. > > * Added Bio::Node::parent and Bio::Node::children (array of nodes) in > order to avoid calling Tree::parent(node) or Tree::children(node), > because those methods call breath first search on the underlying > graph, which makes PhyloXML writer and parser incredibly slow. In > contrast, Bio::Node::parent and Bio::Node::children keeps references > to the respective nodes. > * Updated Tree::add_edge, Tree::clear_edge, Tree::remove_edge, > Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep > track of Node::parent and Node::children nodes correctly. Have I > forgotten anything? > * Now for PhyloXML writer it takes less than 1 second instead of > ~20minutes to write ncbi_taxonomy_mollusca.xml file 1.5MB > * To write the tree of life taxonomy file (~46MB) it takes 10 seconds > (On 2.4GHz, 2.9GB RAM, running Ubuntu) > > The code is in http://github.com/latvianlinuxgirl/bioruby/tree/tree_class > > I wrote unit tests for my changes and made sure my changes don't break > anything else. However, does anybody has code laying around that uses > Tree::parent and Tree::children methods so that I can test it more > thoroughly? > > Cheers, > Diana > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From jan.aerts at gmail.com Mon Nov 16 05:11:24 2009 From: jan.aerts at gmail.com (Jan Aerts) Date: Mon, 16 Nov 2009 10:11:24 +0000 Subject: [BioRuby] BioRuby Phyloxml update In-Reply-To: <20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp> References: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com> <20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp> Message-ID: <4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9@mail.gmail.com> All, I think we should make a good effort of merging Diana's code into the bioruby codebase. Even though I'm not completely familiar with bioruby's phylo implementation, an effort like hers should be welcomed with open arms. If her code speeds things up so immensely, why don't we start a new branch that will lead to bioruby 2.0? Let bioruby 2.0 break things. With a major new release things are allowed to be broken free from the legacy code. We definitely don't want Diana's efforts be in vain. jan. 2009/11/8 Naohisa Goto : > Hi Diana, > > I'm sorry that the changes cannot be accepted, because the > modification of existing Bio::Tree methods breaks things. > Bio::Tree does not want to have children/parent information > in nodes. One of the reasons is that it is difficult to keep > consistency when copying a tree. Nodes can be shared with two > or more trees when copying a tree by using "dup" or "clone" > method. > > Normally, tests for existing classes shold not be modified > except when changing specification or the test's bug, because > they guarantee specification of the class. Adding new tests > are OK. > > If you really want nodes to have parent/children information > in each node, please do so in only PhyloXML classes (though > I'm negative). ?In this case, the problem is that reading phyloxml > data and write back again seems good, but it seems there are > currently no way to convert Bio::Tree to PhyloXML. Now, it seems > hard to convert Newick data to PhyloXML. > > Now, to prepare to include your PhyloXML code in BioRuby, I'm working > on my branch. Some API changes will be made. > http://github.com/ngoto/bioruby/tree/incoming > > Note that in your test code, argument order of assert_equal is wrong. > I've already fixed in my branch. > http://github.com/ngoto/bioruby/commit/a291af62ef262ee04f3a0e1b6415d4e256c56a94 > >> * Updated ?Tree::add_edge, Tree::clear_edge, Tree::remove_edge, >> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep >> track of Node::parent and Node::children nodes correctly. ?Have I >> forgotten anything? > > Changing root with tree.root=(). > > -- > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > >> Hi all, >> >> So finally I have updated Bio::Tree and Bio::Node classes to improve >> the phyloxml writer speed. >> >> * Added Bio::Node::parent and ?Bio::Node::children (array of nodes) in >> order to avoid calling Tree::parent(node) or Tree::children(node), >> because those methods call breath first search on the underlying >> graph, which makes PhyloXML writer and parser incredibly slow. In >> contrast, Bio::Node::parent and Bio::Node::children keeps references >> to the respective nodes. >> * Updated ?Tree::add_edge, Tree::clear_edge, Tree::remove_edge, >> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep >> track of Node::parent and Node::children nodes correctly. ?Have I >> forgotten anything? >> * Now for PhyloXML writer it takes less than 1 second instead of >> ~20minutes to write ncbi_taxonomy_mollusca.xml file 1.5MB >> * To write the tree of life taxonomy file (~46MB) it takes 10 seconds >> (On 2.4GHz, 2.9GB RAM, running Ubuntu) >> >> The code is in http://github.com/latvianlinuxgirl/bioruby/tree/tree_class >> >> I wrote unit tests for my changes and made sure my changes don't break >> anything else. However, does anybody has code laying around that uses >> Tree::parent and Tree::children methods so that I can test it more >> thoroughly? >> >> Cheers, >> Diana >> _______________________________________________ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From georgkam at gmail.com Tue Nov 17 00:40:31 2009 From: georgkam at gmail.com (George Githinji) Date: Tue, 17 Nov 2009 08:40:31 +0300 Subject: [BioRuby] BioRuby Digest, Vol 50, Issue 6 In-Reply-To: References: Message-ID: <55915f820911162140w592077f4o448d63e11b4300be@mail.gmail.com> If Ruby itself is known to be slow compared to other interpreters, and Diana;s code speeds up things, as a Bioruby user i would plead with the developers to adopt her code in the next release with the speed optimizations. The next release can only be better if the current code base is overhauled and reviewed based on new developments like Diana's. If Newick can be converted to a format which can then be converted to PhyloXML, then conversion to newick is not a problem. Else I would question the use of Newick format if it cannot be inter-converted to other file formats. On Mon, Nov 16, 2009 at 8:00 PM, wrote: > Send BioRuby mailing list submissions to > bioruby at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/bioruby > or, via email, send a message with subject or body 'help' to > bioruby-request at lists.open-bio.org > > You can reach the person managing the list at > bioruby-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of BioRuby digest..." > > > Today's Topics: > > 1. Re: BioRuby Phyloxml update (Jan Aerts) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 16 Nov 2009 10:11:24 +0000 > From: Jan Aerts > Subject: Re: [BioRuby] BioRuby Phyloxml update > To: Naohisa Goto > Cc: phyloxml at yahoogroups.com, Pjotr Prins , > bioruby at lists.open-bio.org, Diana Jaunzeikare > Message-ID: > <4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > All, > > I think we should make a good effort of merging Diana's code into the > bioruby codebase. Even though I'm not completely familiar with > bioruby's phylo implementation, an effort like hers should be welcomed > with open arms. > > If her code speeds things up so immensely, why don't we start a new > branch that will lead to bioruby 2.0? Let bioruby 2.0 break things. > With a major new release things are allowed to be broken free from the > legacy code. > > We definitely don't want Diana's efforts be in vain. > > jan. > > 2009/11/8 Naohisa Goto : > > Hi Diana, > > > > I'm sorry that the changes cannot be accepted, because the > > modification of existing Bio::Tree methods breaks things. > > Bio::Tree does not want to have children/parent information > > in nodes. One of the reasons is that it is difficult to keep > > consistency when copying a tree. Nodes can be shared with two > > or more trees when copying a tree by using "dup" or "clone" > > method. > > > > Normally, tests for existing classes shold not be modified > > except when changing specification or the test's bug, because > > they guarantee specification of the class. Adding new tests > > are OK. > > > > If you really want nodes to have parent/children information > > in each node, please do so in only PhyloXML classes (though > > I'm negative). ?In this case, the problem is that reading phyloxml > > data and write back again seems good, but it seems there are > > currently no way to convert Bio::Tree to PhyloXML. Now, it seems > > hard to convert Newick data to PhyloXML. > > > > Now, to prepare to include your PhyloXML code in BioRuby, I'm working > > on my branch. Some API changes will be made. > > http://github.com/ngoto/bioruby/tree/incoming > > > > Note that in your test code, argument order of assert_equal is wrong. > > I've already fixed in my branch. > > > http://github.com/ngoto/bioruby/commit/a291af62ef262ee04f3a0e1b6415d4e256c56a94 > > > >> * Updated ?Tree::add_edge, Tree::clear_edge, Tree::remove_edge, > >> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep > >> track of Node::parent and Node::children nodes correctly. ?Have I > >> forgotten anything? > > > > Changing root with tree.root=(). > > > > -- > > Naohisa Goto > > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > > > > >> Hi all, > >> > >> So finally I have updated Bio::Tree and Bio::Node classes to improve > >> the phyloxml writer speed. > >> > >> * Added Bio::Node::parent and ?Bio::Node::children (array of nodes) in > >> order to avoid calling Tree::parent(node) or Tree::children(node), > >> because those methods call breath first search on the underlying > >> graph, which makes PhyloXML writer and parser incredibly slow. In > >> contrast, Bio::Node::parent and Bio::Node::children keeps references > >> to the respective nodes. > >> * Updated ?Tree::add_edge, Tree::clear_edge, Tree::remove_edge, > >> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep > >> track of Node::parent and Node::children nodes correctly. ?Have I > >> forgotten anything? > >> * Now for PhyloXML writer it takes less than 1 second instead of > >> ~20minutes to write ncbi_taxonomy_mollusca.xml file 1.5MB > >> * To write the tree of life taxonomy file (~46MB) it takes 10 seconds > >> (On 2.4GHz, 2.9GB RAM, running Ubuntu) > >> > >> The code is in > http://github.com/latvianlinuxgirl/bioruby/tree/tree_class > >> > >> I wrote unit tests for my changes and made sure my changes don't break > >> anything else. However, does anybody has code laying around that uses > >> Tree::parent and Tree::children methods so that I can test it more > >> thoroughly? > >> > >> Cheers, > >> Diana > >> _______________________________________________ > >> BioRuby mailing list > >> BioRuby at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > > _______________________________________________ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > > > > ------------------------------ > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > > > End of BioRuby Digest, Vol 50, Issue 6 > ************************************** > -- --------------- Sincerely George Skype: george_g2 Blog: http://biorelated.wordpress.com/ From djaunzei at smith.edu Tue Nov 17 09:52:59 2009 From: djaunzei at smith.edu (Diana Jaunzeikare) Date: Tue, 17 Nov 2009 09:52:59 -0500 Subject: [BioRuby] BioRuby Phyloxml update In-Reply-To: <4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9@mail.gmail.com> References: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com> <20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp> <4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9@mail.gmail.com> Message-ID: <4057d3bf0911170652o3683768ax366df83afcfa48e0@mail.gmail.com> Thanks for discussion. I see Naohisa's point that it is difficult to keep consistency when copying a tree. Right now PhyloXML class inherits from Bio::Tree class. Instead, I could write a new general Bio::FamilyTree class (per Pjotr's suggestion), which would be strictly a tree (I believe that Bio::Tree allows for a node to have 2 parents) and would have parent/child information. Thus it would not need underlying general graph implementation, therefore making the implementation simpler than that of Bio::Tree. Then PhyloXML::Tree would inherit from Bio::FamilyTree. This way PhyloXML writer probably would be even faster because it would not need to update Bio::Pathway structure (which is under Bio::Tree) every time adding a node or edge. Additionally, I think BioRuby would benefit from general Bio::FamilyTree class. I recently heard a talk by researcher who did phylogenetic analysis of musical rhythms. Also I will write method to convert from newick to PhyloXML. What do you think? Cheers, Diana On Mon, Nov 16, 2009 at 5:11 AM, Jan Aerts wrote: > All, > > I think we should make a good effort of merging Diana's code into the > bioruby codebase. Even though I'm not completely familiar with > bioruby's phylo implementation, an effort like hers should be welcomed > with open arms. > > If her code speeds things up so immensely, why don't we start a new > branch that will lead to bioruby 2.0? Let bioruby 2.0 break things. > With a major new release things are allowed to be broken free from the > legacy code. > > We definitely don't want Diana's efforts be in vain. > > jan. > > 2009/11/8 Naohisa Goto : >> Hi Diana, >> >> I'm sorry that the changes cannot be accepted, because the >> modification of existing Bio::Tree methods breaks things. >> Bio::Tree does not want to have children/parent information >> in nodes. One of the reasons is that it is difficult to keep >> consistency when copying a tree. Nodes can be shared with two >> or more trees when copying a tree by using "dup" or "clone" >> method. >> >> Normally, tests for existing classes shold not be modified >> except when changing specification or the test's bug, because >> they guarantee specification of the class. Adding new tests >> are OK. >> >> If you really want nodes to have parent/children information >> in each node, please do so in only PhyloXML classes (though >> I'm negative). ?In this case, the problem is that reading phyloxml >> data and write back again seems good, but it seems there are >> currently no way to convert Bio::Tree to PhyloXML. Now, it seems >> hard to convert Newick data to PhyloXML. >> >> Now, to prepare to include your PhyloXML code in BioRuby, I'm working >> on my branch. Some API changes will be made. >> http://github.com/ngoto/bioruby/tree/incoming >> >> Note that in your test code, argument order of assert_equal is wrong. >> I've already fixed in my branch. >> http://github.com/ngoto/bioruby/commit/a291af62ef262ee04f3a0e1b6415d4e256c56a94 >> >>> * Updated ?Tree::add_edge, Tree::clear_edge, Tree::remove_edge, >>> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep >>> track of Node::parent and Node::children nodes correctly. ?Have I >>> forgotten anything? >> >> Changing root with tree.root=(). >> >> -- >> Naohisa Goto >> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org >> >> >>> Hi all, >>> >>> So finally I have updated Bio::Tree and Bio::Node classes to improve >>> the phyloxml writer speed. >>> >>> * Added Bio::Node::parent and ?Bio::Node::children (array of nodes) in >>> order to avoid calling Tree::parent(node) or Tree::children(node), >>> because those methods call breath first search on the underlying >>> graph, which makes PhyloXML writer and parser incredibly slow. In >>> contrast, Bio::Node::parent and Bio::Node::children keeps references >>> to the respective nodes. >>> * Updated ?Tree::add_edge, Tree::clear_edge, Tree::remove_edge, >>> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep >>> track of Node::parent and Node::children nodes correctly. ?Have I >>> forgotten anything? >>> * Now for PhyloXML writer it takes less than 1 second instead of >>> ~20minutes to write ncbi_taxonomy_mollusca.xml file 1.5MB >>> * To write the tree of life taxonomy file (~46MB) it takes 10 seconds >>> (On 2.4GHz, 2.9GB RAM, running Ubuntu) >>> >>> The code is in http://github.com/latvianlinuxgirl/bioruby/tree/tree_class >>> >>> I wrote unit tests for my changes and made sure my changes don't break >>> anything else. However, does anybody has code laying around that uses >>> Tree::parent and Tree::children methods so that I can test it more >>> thoroughly? >>> >>> Cheers, >>> Diana >>> _______________________________________________ >>> BioRuby mailing list >>> BioRuby at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioruby >> >> _______________________________________________ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby >> > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From ngoto at gen-info.osaka-u.ac.jp Tue Nov 17 11:27:46 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 18 Nov 2009 01:27:46 +0900 Subject: [BioRuby] BioRuby Phyloxml update In-Reply-To: <4057d3bf0911170652o3683768ax366df83afcfa48e0@mail.gmail.com> References: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com> <20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp> <4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9@mail.gmail.com> <4057d3bf0911170652o3683768ax366df83afcfa48e0@mail.gmail.com> Message-ID: <20091117162747.C1CD81CBC41B@idnmail.gen-info.osaka-u.ac.jp> Hi, I've just committed speed-up of Bio::Tree#children in my repository. It keeps compatibility. Trade-off for the speed-up, memory consumption is a little bit larger than the previous code. http://github.com/ngoto/bioruby For the benchmark of reading and writing big PhyloXML code, based on Diana's test_phyloxml_big.rb, a new sample code is added as sample/test_phyloxml_big.rb. Running the new sample/test_phyloxml_big.rb on a machine (Pentium D 3.40GHz, memory 4GB, running Debian GNU/Linux) with http://github.com/ngoto/bioruby: 47.52user 0.93system 0:50.09elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+141424outputs (0major+167550minor)pagefaults 0swaps with http://github.com/latvianlinuxgirl/bioruby/tree/tree_class 43.55user 1.00system 0:46.59elapsed 95%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+141424outputs (0major+165151minor)pagefaults 0swaps Although my new code is still ~10% slower than Diana's new code, I think it can be acceptable because my code keeps compatibility. I wrote Bio::Tree because I want to manipulate trees flexibly, e.g. merging and splitting trees, changing root of trees. For the purpose, I didn't take the way to have parent/children in a node. I also think the current Bio::Tree is not the best. One of the weak points is it is relatively heavy. The flexibility may not be needed for parsers only representing fixed data structure. New class seems attractive for usages that can not be coverd with the current Bio::Tree implementation. Thanks, Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Tue, 17 Nov 2009 09:52:59 -0500 Diana Jaunzeikare wrote: > Thanks for discussion. I see Naohisa's point that it is difficult to > keep consistency when copying a tree. > > Right now PhyloXML class inherits from Bio::Tree class. Instead, I > could write a new general Bio::FamilyTree class (per Pjotr's > suggestion), which would be strictly a tree (I believe that Bio::Tree > allows for a node to have 2 parents) and would have parent/child > information. Thus it would not need underlying general graph > implementation, therefore making the implementation simpler than that > of Bio::Tree. Then PhyloXML::Tree would inherit from Bio::FamilyTree. > This way PhyloXML writer probably would be even faster because it > would not need to update Bio::Pathway structure (which is under > Bio::Tree) every time adding a node or edge. > Additionally, I think BioRuby would benefit from general > Bio::FamilyTree class. I recently heard a talk by researcher who did > phylogenetic analysis of musical rhythms. > > Also I will write method to convert from newick to PhyloXML. > > What do you think? > > Cheers, > Diana From tomoakin at kenroku.kanazawa-u.ac.jp Tue Nov 17 19:24:34 2009 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Wed, 18 Nov 2009 09:24:34 +0900 Subject: [BioRuby] BioRuby Phyloxml update In-Reply-To: <20091117162747.C1CD81CBC41B@idnmail.gen-info.osaka-u.ac.jp> References: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com> <20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp> <4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9@mail.gmail.com> <4057d3bf0911170652o3683768ax366df83afcfa48e0@mail.gmail.com> <20091117162747.C1CD81CBC41B@idnmail.gen-info.osaka-u.ac.jp> Message-ID: Hi, One point seems that tree can be unrooted or rooted. Perhaps, Goto-san's Bio::Tree represents unrooted tree (not distinguishing parents and childrenn), while Diana's class is for rooted trees (having distinction of parents and children). If, this is the point, Bio::RootedTree is better name than Bio::FamilyTree. In general, rooted tree should be easily converted to unrooted tree, while conversion of an unrooted tree to rooted tree requires specification of the root. For text representation like NEWICK there is anyway a root while the tree can be interpreted either as rooted or unrooted. It could be good to have distinct interface for rooted and unrooted trees, to let the user's be aware of the conceptual difference. -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan From tomoakin at kenroku.kanazawa-u.ac.jp Wed Nov 18 19:33:32 2009 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Thu, 19 Nov 2009 09:33:32 +0900 Subject: [BioRuby] Blast to Phylogeny In-Reply-To: <4B045622.8040204@broadinstitute.org> References: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com> <20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp> <4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9@mail.gmail.com> <4057d3bf0911170652o3683768ax366df83afcfa48e0@mail.gmail.com> <20091117162747.C1CD81CBC41B@idnmail.gen-info.osaka-u.ac.jp> <4B045622.8040204@broadinstitute.org> Message-ID: Hi, In general, to construct a phylogenetic tree from molecular sequence data, you will collect the homologous sequences, perform multiple alignment, identify the region that will be used for the reconstruction, and then pass the data to an appropriate program to reconstruct the phylogeny. If I have a BLAST output, I would parse that file with Bio::FlatFile and extract the identifiers of the hit sequences, use the identifiers to collect individual sequences and submit the sequences to mafft for multiple alignment. Convert the alignment to nexus format and manually check with MacClade, and then parse the edited nexus file to write the multiple alignment readable by the phylogenetic analysis program. There are many options you can take at each step. So, there are multiple ways, but not a single simple way. :( Bioruby has support for multiple alignment programs like mafft, muscle, and clustalw. For phylogenetic reconstruction, there is some support for phylip and paml (I don't have tried these feature from Bioruby library, though). There are a number of programs for phylogenetic analysis other than phylip and paml. A list compiled by J. Felsenstein is available at http://evolution.genetics.washington.edu/phylip/software.html An alignment similar to that of phylip will be accepted by most programs. -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan On 2009/11/19, at 5:16, Sharvari Gujja wrote: > Hi, > > I am trying to construct a phylogenetic tree from Blast > output...Could you please let me know if there is a way to do > this..I have also been looking at Bio::Tree documentation but it is > not clear if it accepts Blast file as input. > > Appreciate any help. > > Thanks > Sharvari From robert.citek at gmail.com Thu Nov 19 15:06:22 2009 From: robert.citek at gmail.com (Robert Citek) Date: Thu, 19 Nov 2009 15:06:22 -0500 Subject: [BioRuby] custom blast scoring matrix Message-ID: <4145b6790911191206r53c86818m280e3a149f9293ec@mail.gmail.com> Hello all, I would like to create a custom BLAST scoring matrix that I can use with NCBI's blastall. For example, let's say I want to create a modified BLOSUM62 matrix called BLOSUM62ar, where the A:R score is now 2 instead of -1. Some questions that I have: 1) is this possible? 2) if it is, where can I find documentation which describes how to do this? 3) is the blast output different from a regular blast? 4) if it is different, does bio-ruby have blast parsers that can parse the output? Thanks in advance for any pointers and suggestions. Regards, - Robert From georgkam at gmail.com Sat Nov 21 03:58:53 2009 From: georgkam at gmail.com (George Githinji) Date: Sat, 21 Nov 2009 11:58:53 +0300 Subject: [BioRuby] custom blast scoring matrix Message-ID: <55915f820911210058g6e089f76wc5c9cbfdad3e9966@mail.gmail.com> Hi Martin, Thanks for bringing the topic on list. Sometimes back i was also very interested in custom matrices for NCBI blast. Making custom Matrices is possible. check this out BMC Bioinformatics 2008, 9:236 doi:10.1186/1471-2105-9-236 However making your matrices work with NCBI blast is slightly difficult as you need to recompile the BLAST program and incoporate your modifications. I found this a little bit not so straighforward. Lack of good documentation. I wonder whether there is someone who has implemented the BLAST algorithm in Ruby. (The argument is usually that the C implementation is very optimized and good, so why would one want to implement it in ruby?) though i would not buy that argument for learning purposes. The closest i came to a BLAST algorithm is an implementation of it in Perl, in the book Genomic Perl by Rex A. Dwyer, He also outlines how to create your own matrices with code listings in perl. Please ping me back if you get more resources. :) George On Fri, Nov 20, 2009 at 8:00 PM, wrote: > Send BioRuby mailing list submissions to > bioruby at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/bioruby > or, via email, send a message with subject or body 'help' to > bioruby-request at lists.open-bio.org > > You can reach the person managing the list at > bioruby-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of BioRuby digest..." > > > Today's Topics: > > 1. custom blast scoring matrix (Robert Citek) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 19 Nov 2009 15:06:22 -0500 > From: Robert Citek > Subject: [BioRuby] custom blast scoring matrix > To: bioruby > Message-ID: > <4145b6790911191206r53c86818m280e3a149f9293ec at mail.gmail.com> > Content-Type: text/plain; charset=UTF-8 > > Hello all, > > I would like to create a custom BLAST scoring matrix that I can use > with NCBI's blastall. For example, let's say I want to create a > modified BLOSUM62 matrix called BLOSUM62ar, where the A:R score is now > 2 instead of -1. > > Some questions that I have: > > 1) is this possible? > 2) if it is, where can I find documentation which describes how to do this? > 3) is the blast output different from a regular blast? > 4) if it is different, does bio-ruby have blast parsers that can parse > the output? > > Thanks in advance for any pointers and suggestions. > > Regards, > - Robert > > > ------------------------------ > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > > > End of BioRuby Digest, Vol 50, Issue 10 > *************************************** > -- --------------- Sincerely George Skype: george_g2 Blog: http://biorelated.wordpress.com/ From robert.citek at gmail.com Sun Nov 22 08:55:58 2009 From: robert.citek at gmail.com (Robert Citek) Date: Sun, 22 Nov 2009 08:55:58 -0500 Subject: [BioRuby] custom blast scoring matrix In-Reply-To: <55915f820911210058g6e089f76wc5c9cbfdad3e9966@mail.gmail.com> References: <55915f820911210058g6e089f76wc5c9cbfdad3e9966@mail.gmail.com> Message-ID: <4145b6790911220555q410187fak9f8b1b66e4a0ddf2@mail.gmail.com> On Sat, Nov 21, 2009 at 3:58 AM, George Githinji wrote: > Thanks for bringing the topic on list. Sometimes back i was also very > interested in custom matrices for NCBI blast. > Making custom Matrices is possible. check this out > BMC Bioinformatics 2008, 9:236 doi:10.1186/1471-2105-9-236 Thanks for the citation. I'll have a look into that. > However making your matrices work with NCBI blast is slightly difficult as > you need to recompile the BLAST program and incoporate your modifications. I > found this a little bit not so straighforward. Lack of good documentation. That's unfortunate. I've tried compiling NCBI blast a few times in the past and don't ever recall having success with it, running into the same issues you describe. But it's been a while and maybe the process has become easier. I'll give it a whirl. > I wonder whether there is someone who has implemented the BLAST algorithm in > Ruby. (The argument is usually that the C implementation is very optimized > and good, so why would one want to implement it in ruby?) though i would not > buy that argument for learning purposes. ?The closest i came to a BLAST > algorithm is an implementation of it in Perl, in the book Genomic Perl by > Rex A. Dwyer, He also outlines how to create your own matrices with code > listings in perl. Thanks. I'll have a look at that as well. > Please ping me back if you get more resources. :) Will do. Regards, - Robert From pjotr.public14 at thebird.nl Thu Nov 26 08:08:30 2009 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Thu, 26 Nov 2009 14:08:30 +0100 Subject: [BioRuby] Ruby EMBOSS mapping (using Biolib) Message-ID: <20091126130830.GA19003@thebird.nl> Hi all, The last year I have been working on C library mappings to Ruby. A comparison of Bioruby against Biolib/EMBOSS six frame translation of a C.elegans dataset shows the Ruby with EMBOSS version is about 30x faster. On my (outdated) machine: Bioruby version: 22929 records 137574 times translated! real 9m30.952s user 8m42.877s sys 0m32.878s Biolib version: 22929 records 137574 times translated! real 0m20.306s user 0m15.997s sys 0m1.344s This is including IO - which is handled by Ruby. The Bioruby code reads: nt = FastaReader.new(fn) nt.each { | rec | seq = Bio::Sequence::NA.new(rec.seq) [-3,-2,-1,1,2,3].each do | frame | print "> ",rec.id," ",frame.to_s,"\n" print seq.translate(frame),"\n" end } $stderr.print nt.size," records ",nt.size*6*iter," times translated!" The Biolib code reads nt = FastaReader.new(fn) trnTable = Biolib::Emboss.ajTrnNewI(1); nt.each { | rec | ajpseq = Biolib::Emboss.ajSeqNewNameC(rec.seq,"Test sequence") [-3,-2,-1,1,2,3].each do | frame | ajpseqt = Biolib::Emboss.ajTrnSeqOrig(trnTable,ajpseq,frame) aa = Biolib::Emboss.ajSeqGetSeqCopyC(ajpseqt) print "> ",rec.id," ",frame.to_s,"\n" print aa,"\n" end } $stderr.print nt.size," records ",nt.size*6*iter," times translated!" A write up of the mapping effort is at: http://biolib.open-bio.org/wiki/Mapping_EMBOSS From pjotr.public14 at thebird.nl Thu Nov 26 08:44:27 2009 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Thu, 26 Nov 2009 14:44:27 +0100 Subject: [BioRuby] Announcing BigBio project for Ruby Message-ID: <20091126134427.GA20660@thebird.nl> BigBio = BIG DATA computing (for Ruby) BigBio is an initiative to a create high performance libraries for big data computing in biology - initially for the Ruby language. The Ruby version of BioBig uses BioRuby, when sensible, but provides an interface with a different design. Also, unlike BioRuby which aims to be pure Ruby, it uses BioLib C/C++ functions for increased performance and reduced memory consumption. The first module is an (indexed) FastaReader which does not load the full FASTA file in memory. http://github.com/pjotrp/bigbio Pj. From jan.aerts at gmail.com Thu Nov 26 08:44:58 2009 From: jan.aerts at gmail.com (Jan Aerts) Date: Thu, 26 Nov 2009 13:44:58 +0000 Subject: [BioRuby] VCF Message-ID: <4c7507a70911260544j4ba5f089y38c76d4f48131258@mail.gmail.com> Is anyone working on a VCF (Variant Call Format) parser in bioruby? http://1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcfv3.2 From jan.aerts at gmail.com Thu Nov 26 08:46:52 2009 From: jan.aerts at gmail.com (Jan Aerts) Date: Thu, 26 Nov 2009 13:46:52 +0000 Subject: [BioRuby] Announcing BigBio project for Ruby In-Reply-To: <20091126134427.GA20660@thebird.nl> References: <20091126134427.GA20660@thebird.nl> Message-ID: <4c7507a70911260546w45839e7fra4a2565a66bc47ff@mail.gmail.com> Interesting... Planning to incorporate SAM/BAM alignment formats for nextgen sequences? jan. 2009/11/26 Pjotr Prins : > BigBio = BIG DATA computing (for Ruby) > > BigBio is an initiative to a create high performance libraries for big data > computing in biology - initially for the Ruby language. > > The Ruby version of BioBig uses BioRuby, when sensible, but provides an > interface with a different design. Also, unlike BioRuby which aims to be pure > Ruby, it uses BioLib C/C++ functions for increased performance and reduced > memory consumption. > > The first module is an (indexed) FastaReader which does not load the > full FASTA file in memory. > > http://github.com/pjotrp/bigbio > > Pj. > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From jan.aerts at gmail.com Thu Nov 26 08:52:16 2009 From: jan.aerts at gmail.com (Jan Aerts) Date: Thu, 26 Nov 2009 13:52:16 +0000 Subject: [BioRuby] Bio::DB::Sam Message-ID: <4c7507a70911260552u5ff78223r4905a27850e20179@mail.gmail.com> And another parser that probably should be added to bioruby: something to interact with SAM/BAM files (which contain mapping positions for short reads). More info at samtools.sourceforge.net Lincoln has written a nice API for perl: Bio::DB::Sam. Maybe we should go for something similar? http://search.cpan.org/~lds/Bio-SamTools-1.07/lib/Bio/DB/Sam.pm Is anyone already working on this? jan. From pjotr.public14 at thebird.nl Thu Nov 26 09:17:03 2009 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Thu, 26 Nov 2009 15:17:03 +0100 Subject: [BioRuby] Bio::DB::Sam In-Reply-To: <4c7507a70911260552u5ff78223r4905a27850e20179@mail.gmail.com> References: <4c7507a70911260552u5ff78223r4905a27850e20179@mail.gmail.com> Message-ID: <20091126141703.GA21032@thebird.nl> On Thu, Nov 26, 2009 at 01:52:16PM +0000, Jan Aerts wrote: > And another parser that probably should be added to bioruby: something > to interact with SAM/BAM files (which contain mapping positions for > short reads). More info at samtools.sourceforge.net by the looks of it - it should be relatively easy with SWIG - and therefore Biolib. > Lincoln has written a nice API for perl: Bio::DB::Sam. Maybe we should > go for something similar? > http://search.cpan.org/~lds/Bio-SamTools-1.07/lib/Bio/DB/Sam.pm Wow, this guy is hard core! Doing this with PerlXS takes a *lot* of effort. XS is sooooo nineties ;-) > Is anyone already working on this? I am happy to write a SWIG mapper. If someone really cares to use it and will write the higher-level Ruby interface (nice OOP class representation). I have been told Bioruby is pure Ruby - so this will not fit in. Pj. From biopython at maubp.freeserve.co.uk Thu Nov 26 11:02:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Nov 2009 16:02:50 +0000 Subject: [BioRuby] Fwd: [DAS] DAS workshop 7th-9th April 2010 In-Reply-To: References: Message-ID: <320fb6e00911260802wb98b28fic8a193c125e29d9c@mail.gmail.com> This might be of interest to some of you. Peter ---------- Forwarded message ---------- From: Jonathan Warren Date: Thu, Nov 26, 2009 at 2:57 PM Subject: [DAS] DAS workshop 7th-9th April 2010 To: das at biodas.org, das_registry_announce at sanger.ac.uk, biojava-dev , BioJava , BioPerl , all at sanger.ac.uk, all at ebi.ac.uk, ensembldev We are considering running a Distributed Annotation System workshop here at the Sanger/EBI in the UK subject to decent demand. The workshop will be held from Wednesday 7th-Friday 9th April 2010. If you would be interested in attending either to present or just take part then please email me jw12 at sanger.ac.uk The format of the workshop is likely to be similar to last years (1st day for beginners, 2nd for both beginners and advanced users, 3rd day for advanced), information for which can be found here: http://www.dasregistry.org/course.jsp If you would like to present then please send a short summary of what you would like to talk about. Thanks Jonathan. Jonathan Warren Senior Developer and DAS coordinator jw12 at sanger.ac.uk -- The Wellcome Trust Sanger Institute is operated by Genome ResearchLimited, a charity registered in England with number 1021457 and acompany registered in England with number 2742969, whose registeredoffice is 215 Euston Road, London, NW1 2BE._______________________________________________ DAS mailing list DAS at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/das From josejotero at gmail.com Fri Nov 27 21:55:38 2009 From: josejotero at gmail.com (Jose Otero) Date: Fri, 27 Nov 2009 18:55:38 -0800 Subject: [BioRuby] Bio::GenBank Message-ID: Hello all, I'm new to BioRUby and I am trying to adapt the BioGenbank class to store information of my plasmid database. Question 1: Does anybody know how to insert a nucleic acid sequence as the value to 'sequence' in the @data object? Placing in inoformation as Bio::Feature::Qualifier objects is easy, as is inserting Bio::Locus information. But I can't figure how to insert the sequence data. Question 2: Has anybody ever changed the data from a BioGenbank object and save the altered file? This would be very interesting for my plasmid database. Thanks for the help. JO From ngoto at gen-info.osaka-u.ac.jp Sat Nov 28 04:00:01 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Sat, 28 Nov 2009 18:00:01 +0900 Subject: [BioRuby] Bio::GenBank In-Reply-To: References: Message-ID: <20091128090002.372041CBC49E@idnmail.gen-info.osaka-u.ac.jp> Hello Jose, On Fri, 27 Nov 2009 18:55:38 -0800 Jose Otero wrote: > Hello all, > I'm new to BioRUby and I am trying to adapt the BioGenbank class to store > information of my plasmid database. > Question 1: Does anybody know how to insert a nucleic acid sequence as the > value to 'sequence' in the @data object? > Placing in inoformation as Bio::Feature::Qualifier objects is easy, as is > inserting Bio::Locus information. But I can't figure how to insert the > sequence data. Once an object of the Bio::GenBank class is created, each data stored in the object is intended to be read-only, though modification is not explicitly prohibited. This is because the class is designed for efficient parsing of the GenBank formatted text, and it is technically not easy to achieve both efficient parsing and flexible modification. (This is also applied to most parser classes, e.g. Bio::EMBL, Bio::SPTR, etc.) In your case, using Bio::Sequence seems the best way. After converted to Bio::Sequence object, from a Bio::GenBank object, it can be freely modified. # Assume str contains GenBank formatted text as String. # # Creating a new Bio::GenBank object. gb = Bio::GenBank.new(str) # Converting to Bio::Sequence object s = gb.to_biosequence # Modifying the sequence. # # Note that other attributes, such as features and references # (which depend on locations on the sequence) are kept unchanged. # Relocation of the features, references, etc. is relied on the # user. # s.seq = 'atgc' * 10 + s.seq # Text formatting as the GenBank format. puts s.output(:genbank) Creating a new Bio::Sequence object from scratch, giving definition, accessions, keywords, references, features, etc., and getting GenBank-formatted text can also be done. > Question 2: Has anybody ever changed the data from a BioGenbank object and > save the altered file? This would be very interesting for my plasmid > database. As described above, Bio::Sequence#output can be used. The method returns formatted text as String, and you can easily write it to a file. > Thanks for the help. > JO Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From yannick.wurm at unil.ch Tue Nov 3 14:11:52 2009 From: yannick.wurm at unil.ch (Yannick Wurm) Date: Tue, 3 Nov 2009 15:11:52 +0100 Subject: [BioRuby] Ruby speed In-Reply-To: References: Message-ID: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch> Hi, this is a more general ruby question, but since my application is bioinformatics, I'm posting it here. Just wanted to prepend a few characters in front of FASTA identifiers. $time cat clustering/dirsForAssembly/singlets.fasta | ruby -pe "gsub(/ ^>/, '>MyPrefix')" > abc real 0m20.379s user 0m0.741s sys 0m0.168s While the perl equivalent is one heck of a lot faster!!! $time cat clustering/dirsForAssembly/singlets.fasta | perl -p -i -e 's/ ^>/>MyPrefix/g' > ab real 0m2.165s user 0m0.266s sys 0m0.146s Is there any hope for ruby? Thanks, yannick -------------------------------------------- yannick . wurm @ unil . ch Ant Genomics, Ecology & Evolution @ Lausanne http://www.unil.ch/dee/page28685_fr.html From yannick.wurm at unil.ch Tue Nov 3 22:49:12 2009 From: yannick.wurm at unil.ch (Yannick Wurm) Date: Tue, 3 Nov 2009 23:49:12 +0100 Subject: [BioRuby] Ruby speed In-Reply-To: References: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch> Message-ID: <626B2E8A-8E3B-4643-8104-DC1143121523@unil.ch> Hi Mike, thanks for your response. I'm running: ruby 1.8.6 (2008-03-03 patchlevel 114) [x86_64-linux] Starting to age, but on a production machine I'd rather stay with what works than risk breaking things by upgrading them. the command sed 's/^>/>MyPrefix/' is indeed 30% faster than perl :) My reasons for preferring ruby are the same as yours. But a 5 to 10x speed difference is expensive (I'm calling the one-liner below about 10,000 times from a larger ruby script - YES, it's ugly, but refactoring the script to avoid calling that type of oneliner would be a pain since I use 10,000 different prefixes). I have the feeling that it's ruby's startup-time especially. Running the ruby one-liner my a fasta of 40,000 sequences takes 20 seconds; running it a fasta of only 10 lines still takes 13 seconds!! I found some generic benchmarks indicating that ruby is generally only a bit slower than perl http://shootout.alioth.debian.org/u32/benchmark.php?test=all&lang=ruby&lang2=perl So maybe I can keep using ruby - just avoiding one-liners! Best, yannick On 3 Nov 2009, at 22:26, Michael Barton wrote: > What version of Ruby are you using? > Ruby is an expressive language rather than a "fast" language. > I use Ruby because it's easer to read and maintain my programs, rather > than because how fast it is. > > If you are interested purely in speed you could write in C? > What are the benchmarks for something like this? > > time sed 's/^>/>MyPrefix.' clustering/dirsForAssembly/singlets.fasta > > abc > > Mike > > 2009/11/3 Yannick Wurm : >> Hi, >> >> this is a more general ruby question, but since my application is >> bioinformatics, I'm posting it here. >> >> Just wanted to prepend a few characters in front of FASTA >> identifiers. >> >> >> $time cat clustering/dirsForAssembly/singlets.fasta | ruby -pe >> "gsub(/^>/, >> '>MyPrefix')" > abc >> real 0m20.379s >> user 0m0.741s >> sys 0m0.168s >> >> >> While the perl equivalent is one heck of a lot faster!!! >> >> >> $time cat clustering/dirsForAssembly/singlets.fasta | perl -p -i -e >> 's/^>/>MyPrefix/g' > ab >> real 0m2.165s >> user 0m0.266s >> sys 0m0.146s >> >> >> Is there any hope for ruby? >> >> Thanks, >> yannick >> >> >> -------------------------------------------- >> yannick . wurm @ unil . ch >> Ant Genomics, Ecology & Evolution @ Lausanne >> http://www.unil.ch/dee/page28685_fr.html >> >> >> _______________________________________________ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby >> From juanfc at uma.es Tue Nov 3 22:44:10 2009 From: juanfc at uma.es (Juan Falgueras) Date: Tue, 3 Nov 2009 23:44:10 +0100 Subject: [BioRuby] Ruby speed In-Reply-To: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch> References: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch> Message-ID: <87CAA48B-151F-41C3-9DF5-23C4B43BDFD0@uma.es> Hi, have you tried it with Ruby 1.9? El 03/11/2009, a las 15:11, Yannick Wurm escribi?: > Hi, > > this is a more general ruby question, but since my application is > bioinformatics, I'm posting it here. > > Just wanted to prepend a few characters in front of FASTA identifiers. > > > $time cat clustering/dirsForAssembly/singlets.fasta | ruby -pe "gsub > (/^>/, '>MyPrefix')" > abc > real 0m20.379s > user 0m0.741s > sys 0m0.168s > > > While the perl equivalent is one heck of a lot faster!!! > > > $time cat clustering/dirsForAssembly/singlets.fasta | perl -p -i -e > 's/^>/>MyPrefix/g' > ab > real 0m2.165s > user 0m0.266s > sys 0m0.146s > > > Is there any hope for ruby? > > Thanks, > yannick > > > -------------------------------------------- > yannick . wurm @ unil . ch > Ant Genomics, Ecology & Evolution @ Lausanne > http://www.unil.ch/dee/page28685_fr.html > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From trevor at corevx.com Tue Nov 3 23:18:50 2009 From: trevor at corevx.com (Trevor Wennblom) Date: Tue, 3 Nov 2009 17:18:50 -0600 Subject: [BioRuby] Ruby speed In-Reply-To: <626B2E8A-8E3B-4643-8104-DC1143121523@unil.ch> References: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch> <626B2E8A-8E3B-4643-8104-DC1143121523@unil.ch> Message-ID: On Nov 3, 2009, at 4:49 PM, Yannick Wurm wrote: > I found some generic benchmarks indicating that ruby is generally > only a bit slower than perl > http://shootout.alioth.debian.org/u32/benchmark.php?test=all&lang=ruby&lang2=perl http://shootout.alioth.debian.org/u32/benchmark.php?test=all&lang=yarv&lang2=perl&box=1 From robert.citek at gmail.com Wed Nov 4 01:32:12 2009 From: robert.citek at gmail.com (Robert Citek) Date: Tue, 3 Nov 2009 20:32:12 -0500 Subject: [BioRuby] Ruby speed In-Reply-To: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch> References: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch> Message-ID: <4145b6790911031732m731d0b09o199041ab0feb610c@mail.gmail.com> On Tue, Nov 3, 2009 at 9:11 AM, Yannick Wurm wrote: > this is a more general ruby question, but since my application is > bioinformatics, I'm posting it here. > > Just wanted to prepend a few characters in front of FASTA identifiers. > > $time cat clustering/dirsForAssembly/singlets.fasta | ruby -pe "gsub(/^>/, > '>MyPrefix')" > abc > ? ? ? ?real ? ?0m20.379s > ? ? ? ?user ? ?0m0.741s > ? ? ? ?sys ? ? 0m0.168s > > > While the perl equivalent is one heck of a lot faster!!! > > > $time cat clustering/dirsForAssembly/singlets.fasta | perl -p -i -e > 's/^>/>MyPrefix/g' > ab > ? ? ? ?real ? ?0m2.165s > ? ? ? ?user ? ?0m0.266s > ? ? ? ?sys ? ? 0m0.146s > > > Is there any hope for ruby? I get a factor of about three on a 10,000,000 line FASTA file: $ time -p yes ">foo"$'\n'"bar" | head -10000000 | ruby -pe "gsub(/^>/, '>MyPrefix')" > /dev/null real 42.99 user 43.39 sys 0.63 $ time -p yes ">foo"$'\n'"bar" | head -10000000 | perl -pe 's/^>/>MyPrefix/g' > /dev/null real 15.89 user 16.33 sys 0.26 This is with perl 5.8.8 and ruby 1.8.6 on a dual 1.6 GHz CPU with 512 MB RAM. Notice your user and system times are less than a factor of three. It's only the real time that is 10x, which suggests that ruby is waiting on other processes, e.g. disk reads. Regards, - Robert From pjotr.public14 at thebird.nl Wed Nov 4 10:22:45 2009 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Wed, 4 Nov 2009 11:22:45 +0100 Subject: [BioRuby] Ruby speed In-Reply-To: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch> References: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch> Message-ID: <20091104102245.GA13264@thebird.nl> On Tue, Nov 03, 2009 at 03:11:52PM +0100, Yannick Wurm wrote: > Is there any hope for ruby? I guess you mean this tongue in cheek. However, it is dangerous as it may turn off users looking to start with Ruby or Perl. So let me state I think there is plenty of hope for Ruby. You are talking execution speed of 'simple' oneliners. For complex programming Ruby outspeeds Perl, usually in practise. Particularly the speed of getting things done, but also a cleaner way of programming helps create better code. The end result will often be faster. And the third gain is in the code maintenance cycle. I am talking from experience here. I have written a lot of code in both languages (and Python too). Perl6 is getting interesting. The syntax is much cleaned up, proper OOP, and (what I like) strong functional programming support. But its execution speed is not even close to Ruby's now. I have heard people joke that Ruby is what Perl6 was meant to be. Anyway you can see where the Perl folks are heading. Pj. P.S. What is there to stop you from using both languages? From mail at michaelbarton.me.uk Wed Nov 4 11:24:36 2009 From: mail at michaelbarton.me.uk (Michael Barton) Date: Wed, 4 Nov 2009 11:24:36 +0000 Subject: [BioRuby] Ruby speed In-Reply-To: <626B2E8A-8E3B-4643-8104-DC1143121523@unil.ch> References: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch> <626B2E8A-8E3B-4643-8104-DC1143121523@unil.ch> Message-ID: 2009/11/3 Yannick Wurm : > thanks for your response. I'm running: > ruby 1.8.6 (2008-03-03 patchlevel 114) [x86_64-linux] > Starting to age, but on a production machine I'd rather stay with what works > than risk breaking things by upgrading them. I think Ruby 1.9 is now the official Ruby release, so you might want to start trying out using this version, for example Rails 3.0 won't work with Ruby 1.8.6 anymore. I've tried Ruby 1.9 a bit myself and the requirements for compatibility are relatively small. If you still prefer to use 1.8, you could try using REE (http://www.rubyenterpriseedition.com/) which has a few patches to improve performance over vanilla 1.8. You could try using ruby_switcher which makes trying different ruby versions a bit less painful - http://bit.ly/1kY1Qk > the command sed 's/^>/>MyPrefix/' is indeed 30% faster than perl :) Could you just try calling out to sed then? > I have the feeling that it's ruby's startup-time especially. Running the > ruby one-liner my a fasta of 40,000 sequences takes 20 seconds; running it a > fasta of only 10 lines still takes 13 seconds!! You might also want to try experimenting with gsub! instead of gsub as the former does destructive in place substitution while the latter creates an extra object with the substituted text. This extra object creation might also slow performance. Cheers Mike From diapriid at gmail.com Wed Nov 4 18:29:13 2009 From: diapriid at gmail.com (Matt) Date: Wed, 4 Nov 2009 13:29:13 -0500 Subject: [BioRuby] basic remote query and plans for NCBI Bio:Blast hook? Message-ID: <19d6b9770911041029w1142e68ag1d739675c75514f9@mail.gmail.com> Hi all, As far as I can tell there is yet no straightforward way to use Bio:Blast with the NCBI portal? I've seen this on the wiki: "Add remote BLAST search sites", and understand the basic concept, but don't have time at present to work on this. Is anyone actively working on this? (just FYI see http://github.com/kwicher/ruby-blast-at-ncbi). I ask in part because I'm struggling to get a basic remote blast working: seq = Bio::Sequence::NA.new('GTCACAAAATCATGGTTTTGCGGTTAATGCTAATGATTTGCCAGCTGATTGGGAACCATTATTTACAAATGCGAACGACAATACAAATGAAGGAATTGTACACAAAACACATCCATTCTTTAGTGTACAATTTCATCCCGAACACACAGCCGGTCCAGAAGATTTAGAAATCTTATTTGATGTCTTTCTGGATGGAGTAAAAGCATTTAAAAATAAGGAAAAGTTCAYCATGAARGATAAATTGATCGAAAAATTGACTTACACGCCGGATGTACCCGTTTGCACTGAAAAACCTAAAAAGATATTGATTTTAGGTTCAGGCGGTTTATCCATAGGYCAAGCAGGCGAATTTGATTATTCCGGATCTCAGGCTATCAAGGCTCTTAAAGAAGAAAAAATACAAACGGTGYTAATAAATCCAAATATTGCAACGGTTCARACATCAAAAGGCCTTGCGGACAAAGTTTACTTCCTACCCATTACACCGGATTACGTTGAACAGGTTATAAAAGCCGAGCGACCTGATGGTGTGCTTTTAACTTTTGGCGGACAAACAGCTTTGAATTGTGGAATTGAATTAGAAAAAACTAAAGTGTTTCAACGATTCGGTGTTAAAGTGTTGGGTACRCCGATACAATCAATTATTGAAACTGAAGATAGAAAAATATTTTCGGATCGAGTACACGAAATCGGAGAAAAAGTAGCGCCGTCTGCCGCAGTTTATTCGGTGCAAGAAGCTCTAGATGCCGCTGAAATTCTTGGTTATCCCGTTATGGCTCGAGCTGCATTTTCATTAGGTGGACTAGGTTCTGGTTTTGCAAATAATATTGATGAATTAAAACATCTTGCACAACAGGCTCTTGCGCATTCCAACCAGTTAATCATTGATAAATCGCTTAAAGGTTGGAAGGAAGTTGAATACGAGGTCGTTCGTGATGCATATGACAATTGTATTACAGTTTGTAATATGGAAAATGTAGATCCACTAGGAATTCATACAGGGGAGAGTATAGTAGTGGCACCGTCACAAACTCTCTCCAACAAGGAATATAATATGTTGCGTACTACAGCAATTAAAGTGATTCGGCATTTTGGCGTCGTCGGTGAATGTAATATACAATATGCCTTAAATCCACATTCYGAGCAATACTATATAATTGAAGTTAATGCTAGGTTATCGAGGAGTTCGGCACTAGCTAGTAAAGCGACAGGCTATCCATTAGCATACGTTGCGGCTAAACTAGCACTCGGTATCGCTTTACCTGATATTAAAAATTCGGTAACTGGAGTTACCACCGCCTGTTTTGAGCCAAGTTTAGATTACTGTGTGGTAAAAATTCCACGATGGGATTTAGCAAAATTTGTTCGCGTTTCAAAAAATATTGGAAGCTCTATGAAAAGTGTAGGTGAGGTCATGGCAATCGGCCGCCGATTTGAAGAAGCGTTCCAAAAA') blast_factory = Bio::Blast.new('blastn','nr-nt', '', 'genomenet') foo = blast_factory.query(seq) ... freezes, when I ctrl-C from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast/genomenet.rb:224:in `call' from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast/genomenet.rb:224:in `sleep' from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast/genomenet.rb:224:in `exec_genomenet' from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast.rb:368:in `__send__' from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast.rb:368:in `query' from (irb):25 any glaring problems with this? Is it just waiting for the results of the remote query? I noticed that the genomenet blasts are much slower than NCBI in general (I'm in the US). thanks, Matt From diapriid at gmail.com Wed Nov 4 19:57:11 2009 From: diapriid at gmail.com (Matt) Date: Wed, 4 Nov 2009 14:57:11 -0500 Subject: [BioRuby] (previous answered in part) timeout/long time Message-ID: <19d6b9770911041157l1556ac89s4e8c62ad2e20460d@mail.gmail.com> Aha- my queries *are* working, just taking a very long time to finish. Can I limit to say top 10 results? cheers, Matt From yannick.wurm at unil.ch Wed Nov 4 19:56:13 2009 From: yannick.wurm at unil.ch (Yannick Wurm) Date: Wed, 4 Nov 2009 20:56:13 +0100 Subject: [BioRuby] Ruby speed Message-ID: <81E8B742-2508-40DF-8E81-07F1C8126839@unil.ch> > Notice your user and system times are less than a factor of three. > It's only the real time that is 10x, which suggests that ruby is > waiting on other processes, e.g. disk reads. Great point Robert - I hadn't seen that. My guess the difference is due to the fact that ruby is only installed in my networked (sfs) home dir on the linux server, not on the local machine like perl is. Gotta get the sysadmins to install ruby :) cheers! yannick From email2ants at gmail.com Thu Nov 5 16:22:12 2009 From: email2ants at gmail.com (Anthony Underwood) Date: Thu, 5 Nov 2009 16:22:12 +0000 Subject: [BioRuby] basic remote query and plans for NCBI Bio:Blast hook? In-Reply-To: <19d6b9770911041029w1142e68ag1d739675c75514f9@mail.gmail.com> References: <19d6b9770911041029w1142e68ag1d739675c75514f9@mail.gmail.com> Message-ID: <86C24368-84E1-4A43-ABBD-A26B998159B2@gmail.com> Hi Matt I have done a bit of work to get NCBI blast working within bioruby. See this gist on github http://gist.github.com/227160 ncbi_blast.rb defines an exec_ncbi class for the Blast class in bioruby The script ncbi_blast_test.rb illustrates its usage but uses a few functions defined in the blast_functions.rb file essentially the following should work require 'rubygems' require 'bio' require 'ncbi_blast' ENV['http_proxy'] = "http://proxy_server_ip:port_numer" # use this if you are working from behind a proxy and enter ip and port number as appropriate sequence = "ATGAATCCAAATCAGAAAATAATAA........" factory = Bio::Blast.remote('blastn', 'nr', '', 'ncbi') blast_report = factory.query(sequence) blast_report will be a Bio::Blast::Report object which can be parsed as described in the bioruby api The hit definitions are fairly uninformative containing just the accessions. This is why I then have to fetch the data fro embl as follows accession = definition.split("|")[3] accession.sub!(/\..+$/, "") # remove version number server = Bio::Fetch.new('http://www.ebi.ac.uk/cgi-bin/dbfetch') embl_text = server.fetch('embl', accession) embl_object = Bio::EMBL.new(embl_text) puts embl_object.description This is still a work in progress but it worked OK for me. Hope it is of some use to you. Anthony On 4 Nov 2009, at 18:29, Matt wrote: > Hi all, > > As far as I can tell there is yet no straightforward way to use > Bio:Blast with the NCBI portal? I've seen this on the wiki: "Add > remote BLAST search sites", and understand the basic concept, but > don't have time at present to work on this. Is anyone actively > working on this? (just FYI see > http://github.com/kwicher/ruby-blast-at-ncbi). > > I ask in part because I'm struggling to get a basic remote blast > working: > > seq = > Bio::Sequence::NA.new('GTCACAAAATCATGGTTTTGCGGTTAATGCTAATGATTTGCCAGCTGATTGGGAACCATTATTTACAAATGCGAACGACAATACAAATGAAGGAATTGTACACAAAACACATCCATTCTTTAGTGTACAATTTCATCCCGAACACACAGCCGGTCCAGAAGATTTAGAAATCTTATTTGATGTCTTTCTGGATGGAGTAAAAGCATTTAAAAATAAGGAAAAGTTCAYCATGAARGATAAATTGATCGAAAAATTGACTTACACGCCGGATGTACCCGTTTGCACTGAAAAACCTAAAAAGATATTGATTTTAGGTTCAGGCGGTTTATCCATAGGYCAAGCAGGCGAATTTGATTATTCCGGATCTCAGGCTATCAAGGCTCTTAAAGAAGAAAAAATACAAACGGTGYTAATAAATCCAAATATTGCAACGGTTCARACATCAAAAGGCCTTGCGGACAAAGTTTACTTCCTACCCATTACACCGGATTACGTTGAACAGGTTATAAAAGCCGAGCGACCTGATGGTGTGCTTTTAACTTTTGGCGGACAAACAGCTTTGAATTGTGGAATTGAATTAGAAAAAACTAAAGTGTTTCAACGATTCGGTGTTAAAGTGTTGGGTACRCCGATACAATCAATTATTGAAACTGAAGATAGAAAAATATTTTCGGATCGAGTACACGAAATCGGAGAAAAAGTAGCGCCGTCTGCCGCAGTTTATTCGGTGCAAGAAGCTCTAGATGCCGCTGAAATTCTTGGTTATCCCGTTATGGCTCGAGCTGCATTTTCATTAGGTGGACTAGGTTCTGGTTTTGCAAATAATATTGATGAATTAAAACATCTTGCACAACAGGCTCTTGCGCATTCCAACCAGTTAATCATTGATAAATCGCTTAAAGGTTGGAAGGAAGTTGAATACGAGGTCGTTCGTGATGCATATGACAATTGTATTACAGT! > TTGTAATATGGAAAATGTAGATCCACTAGGAATTCATACAGGGGAGAGTATAGTAGTGGCACCGTCACAAACTCTCTCCAACAAGGAATATAATATGTTGCGTACTACAGCAATTAAAGTGATTCGGCATTTTGGCGTCGTCGGTGAATGTAATATACAATATGCCTTAAATCCACATTCYGAGCAATACTATATAATTGAAGTTAATGCTAGGTTATCGAGGAGTTCGGCACTAGCTAGTAAAGCGACAGGCTATCCATTAGCATACGTTGCGGCTAAACTAGCACTCGGTATCGCTTTACCTGATATTAAAAATTCGGTAACTGGAGTTACCACCGCCTGTTTTGAGCCAAGTTTAGATTACTGTGTGGTAAAAATTCCACGATGGGATTTAGCAAAATTTGTTCGCGTTTCAAAAAATATTGGAAGCTCTATGAAAAGTGTAGGTGAGGTCATGGCAATCGGCCGCCGATTTGAAGAAGCGTTCCAAAAA') > > blast_factory = Bio::Blast.new('blastn','nr-nt', '', 'genomenet') > foo = blast_factory.query(seq) > > ... freezes, when I ctrl-C > > from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast/ > genomenet.rb:224:in > `call' > from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast/ > genomenet.rb:224:in > `sleep' > from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast/ > genomenet.rb:224:in > `exec_genomenet' > from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/ > blast.rb:368:in > `__send__' > from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/ > blast.rb:368:in > `query' > from (irb):25 > > any glaring problems with this? Is it just waiting for the results of > the remote query? I noticed that the genomenet blasts are much > slower than NCBI in general (I'm in the US). > > thanks, > Matt > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From kenglish at gmail.com Thu Nov 5 16:43:31 2009 From: kenglish at gmail.com (Kevin English) Date: Thu, 5 Nov 2009 06:43:31 -1000 Subject: [BioRuby] basic remote query and plans for NCBI Bio:Blast hook? In-Reply-To: <19d6b9770911041029w1142e68ag1d739675c75514f9@mail.gmail.com> References: <19d6b9770911041029w1142e68ag1d739675c75514f9@mail.gmail.com> Message-ID: Have you considered downloading the nr-nt databases and running local queries? I played with the Blast Remote for a while but determined it was too slow for our workload... Kevin From yannick.wurm at unil.ch Thu Nov 5 20:06:33 2009 From: yannick.wurm at unil.ch (Yannick Wurm) Date: Thu, 5 Nov 2009 21:06:33 +0100 Subject: [BioRuby] BioRuby Digest, Vol 50, Issue 1 In-Reply-To: References: Message-ID: On 4 Nov 2009, at 18:00, bioruby-request at lists.open-bio.org wrote: > I guess you mean this tongue in cheek. However, it is dangerous as it > may turn off users looking to start with Ruby or Perl. So let me state > I think there is plenty of hope for Ruby. You are talking execution > speed of 'simple' oneliners. For complex programming Ruby outspeeds > Perl, usually in practise. Particularly the speed of getting things > done, but also a cleaner way of programming helps create better code. > The end result will often be faster. And the third gain is in the code > maintenance cycle. I am talking from experience here. I have written > a lot of code in both languages (and Python too). Those are excellent points, Pjotr. > Perl6 is getting interesting. The syntax is much cleaned up, proper > OOP, and (what I like) strong functional programming support. But its > execution speed is not even close to Ruby's now. I have heard people > joke that Ruby is what Perl6 was meant to be. > > Anyway you can see where the Perl folks are heading. Yes, Damion Conway of Perl Best Practices gave us a small workshop recently, and I could help but thinking that Perl6 was an attempt to rubify perl :) > P.S. What is there to stop you from using both languages? Nothing official. But I already find it difficult to keep the R, bash and ruby parts of my brain optimized without mixing in perl and others :) Cheers, yannick From rob.syme at gmail.com Fri Nov 6 02:55:45 2009 From: rob.syme at gmail.com (Rob Syme) Date: Fri, 6 Nov 2009 10:55:45 +0800 Subject: [BioRuby] Parsing large blastout.xml files Message-ID: I'm trying to extract information from a large blast xml file. To parse the xml file, ruby reads the whole file into memory before looking at each entry. For large files (2.5GBish) - the memory requirements become severe. My first approach was to split each query up into its own xml instance, so that Would end up looking more like: Which bioruby has trouble parsing, so the s had to be given their own file: $ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}' Now each file can be parsed individually. I feel like there has to be an easier way. Is there a way to parse large xml files without huge memory overheads, or is that just par for the course? From rozziite at gmail.com Fri Nov 6 03:11:32 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Thu, 5 Nov 2009 22:11:32 -0500 Subject: [BioRuby] Parsing large blastout.xml files In-Reply-To: References: Message-ID: <4057d3bf0911051911v1b98a2c7ka17392afb6204e37@mail.gmail.com> Another option is to use ruby-libxml reader. http://libxml.rubyforge.org/rdoc/index.html It reads the data sequentially thus there is no memory overhead of first reading it all in memory. However, then you would have to parse it from scratch. On that note, maybe it is worth implementing Bio::Blast::Report.libxml or something like that the same way there is Bio::Blast::Report.rexml and Bio::Blast::Report.xmlparser. Dependecy to ruby-libxml in BioRuby library was introducted in PhyloXML parser. Diana On Thu, Nov 5, 2009 at 9:55 PM, Rob Syme wrote: > I'm trying to extract information from a large blast xml file. To parse the > xml file, ruby reads the whole file into memory before looking at each > entry. For large files (2.5GBish) - the memory requirements become severe. > > My first approach was to split each query up into its own xml > instance, so that > > > ? > ? ? > ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? > ? ? > ? ? > ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? > ? ? > ? ? > ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? > ? ? > ? > > > Would end up looking more like: > > ? > ? ? > ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? > ? ? > ? > > > > ? > ? ? > ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? > ? ? > ? > > > > ? > ? ? > ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? ? > ? ? ? > ? ? > ? > > > Which bioruby has trouble parsing, so the s had to be given > their own file: > > $ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}' > > Now each file can be parsed individually. I feel like there has to be an > easier way. Is there a way to parse large xml files without huge memory > overheads, or is that just par for the course? > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From adamnkraut at gmail.com Fri Nov 6 03:17:02 2009 From: adamnkraut at gmail.com (Adam) Date: Thu, 5 Nov 2009 22:17:02 -0500 Subject: [BioRuby] Parsing large blastout.xml files In-Reply-To: References: Message-ID: <134ede0b0911051917nf0877e0y8df95c3147a24d07@mail.gmail.com> You might want to try a SAX Parser instead. REXML from the standard library has a streaming API. LibXML is a lot faster and it's available as a gem. http://libxml.rubyforge.org/ On Thu, Nov 5, 2009 at 9:55 PM, Rob Syme wrote: > I'm trying to extract information from a large blast xml file. To parse the > xml file, ruby reads the whole file into memory before looking at each > entry. For large files (2.5GBish) - the memory requirements become severe. > > My first approach was to split each query up into its own xml > instance, so that > > > > > > > > > > > > > > > > > > > > > > > > > > > > Would end up looking more like: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Which bioruby has trouble parsing, so the s had to be given > their own file: > > $ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}' > > Now each file can be parsed individually. I feel like there has to be an > easier way. Is there a way to parse large xml files without huge memory > overheads, or is that just par for the course? > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From pjotr.public14 at thebird.nl Fri Nov 6 08:58:15 2009 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Fri, 6 Nov 2009 09:58:15 +0100 Subject: [BioRuby] Parsing large blastout.xml files In-Reply-To: <4057d3bf0911051911v1b98a2c7ka17392afb6204e37@mail.gmail.com> References: <4057d3bf0911051911v1b98a2c7ka17392afb6204e37@mail.gmail.com> Message-ID: <20091106085815.GA12244@thebird.nl> Diana is right. We need to revamp the implementation for big results. Not only that, the current implementation has method names do not match the BLAST names. I need something like this pretty soon and was thinking of writing it. Pj. On Thu, Nov 05, 2009 at 10:11:32PM -0500, Diana Jaunzeikare wrote: > Another option is to use ruby-libxml reader. > http://libxml.rubyforge.org/rdoc/index.html It reads the data > sequentially thus there is no memory overhead of first reading it all > in memory. However, then you would have to parse it from scratch. > > On that note, maybe it is worth implementing Bio::Blast::Report.libxml > or something like that the same way there is Bio::Blast::Report.rexml > and Bio::Blast::Report.xmlparser. Dependecy to ruby-libxml in BioRuby > library was introducted in PhyloXML parser. > > Diana > > On Thu, Nov 5, 2009 at 9:55 PM, Rob Syme wrote: > > I'm trying to extract information from a large blast xml file. To parse the > > xml file, ruby reads the whole file into memory before looking at each > > entry. For large files (2.5GBish) - the memory requirements become severe. > > > > My first approach was to split each query up into its own xml > > instance, so that > > > > > > ? > > ? ? > > ? ? ? > > ? ? ? ? > > ? ? ? ? > > ? ? ? ? > > ? ? ? > > ? ? > > ? ? > > ? ? ? > > ? ? ? ? > > ? ? ? ? > > ? ? ? ? > > ? ? ? > > ? ? > > ? ? > > ? ? ? > > ? ? ? ? > > ? ? ? ? > > ? ? ? ? > > ? ? ? > > ? ? > > ? > > > > > > Would end up looking more like: > > > > ? > > ? ? > > ? ? ? > > ? ? ? ? > > ? ? ? ? > > ? ? ? ? > > ? ? ? > > ? ? > > ? > > > > > > > > ? > > ? ? > > ? ? ? > > ? ? ? ? > > ? ? ? ? > > ? ? ? ? > > ? ? ? > > ? ? > > ? > > > > > > > > ? > > ? ? > > ? ? ? > > ? ? ? ? > > ? ? ? ? > > ? ? ? ? > > ? ? ? > > ? ? > > ? > > > > > > Which bioruby has trouble parsing, so the s had to be given > > their own file: > > > > $ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}' > > > > Now each file can be parsed individually. I feel like there has to be an > > easier way. Is there a way to parse large xml files without huge memory > > overheads, or is that just par for the course? > > _______________________________________________ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From pjotr.public14 at thebird.nl Sat Nov 7 07:42:44 2009 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sat, 7 Nov 2009 08:42:44 +0100 Subject: [BioRuby] Parsing large blastout.xml files In-Reply-To: References: Message-ID: <20091107074244.GA22748@thebird.nl> I did the same a while back using xmltwig: http://github.com/pjotrp/biotools/blob/master/bin/blast_split_xml On Fri, Nov 06, 2009 at 10:55:45AM +0800, Rob Syme wrote: > I'm trying to extract information from a large blast xml file. To parse the > xml file, ruby reads the whole file into memory before looking at each > entry. For large files (2.5GBish) - the memory requirements become severe. > > My first approach was to split each query up into its own xml > instance, so that > > > > > > > > > > > > > > > > > > > > > > > > > > > > Would end up looking more like: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Which bioruby has trouble parsing, so the s had to be given > their own file: > > $ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}' > > Now each file can be parsed individually. I feel like there has to be an > easier way. Is there a way to parse large xml files without huge memory > overheads, or is that just par for the course? > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From djaunzei at smith.edu Sun Nov 8 03:50:26 2009 From: djaunzei at smith.edu (Diana Jaunzeikare) Date: Sat, 7 Nov 2009 22:50:26 -0500 Subject: [BioRuby] BioRuby Phyloxml update Message-ID: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com> Hi all, So finally I have updated Bio::Tree and Bio::Node classes to improve the phyloxml writer speed. * Added Bio::Node::parent and Bio::Node::children (array of nodes) in order to avoid calling Tree::parent(node) or Tree::children(node), because those methods call breath first search on the underlying graph, which makes PhyloXML writer and parser incredibly slow. In contrast, Bio::Node::parent and Bio::Node::children keeps references to the respective nodes. * Updated Tree::add_edge, Tree::clear_edge, Tree::remove_edge, Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep track of Node::parent and Node::children nodes correctly. Have I forgotten anything? * Now for PhyloXML writer it takes less than 1 second instead of ~20minutes to write ncbi_taxonomy_mollusca.xml file 1.5MB * To write the tree of life taxonomy file (~46MB) it takes 10 seconds (On 2.4GHz, 2.9GB RAM, running Ubuntu) The code is in http://github.com/latvianlinuxgirl/bioruby/tree/tree_class I wrote unit tests for my changes and made sure my changes don't break anything else. However, does anybody has code laying around that uses Tree::parent and Tree::children methods so that I can test it more thoroughly? Cheers, Diana From ngoto at gen-info.osaka-u.ac.jp Sun Nov 8 12:50:56 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa Goto) Date: Sun, 08 Nov 2009 21:50:56 +0900 Subject: [BioRuby] BioRuby Phyloxml update In-Reply-To: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com> References: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com> Message-ID: <20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp> Hi Diana, I'm sorry that the changes cannot be accepted, because the modification of existing Bio::Tree methods breaks things. Bio::Tree does not want to have children/parent information in nodes. One of the reasons is that it is difficult to keep consistency when copying a tree. Nodes can be shared with two or more trees when copying a tree by using "dup" or "clone" method. Normally, tests for existing classes shold not be modified except when changing specification or the test's bug, because they guarantee specification of the class. Adding new tests are OK. If you really want nodes to have parent/children information in each node, please do so in only PhyloXML classes (though I'm negative). In this case, the problem is that reading phyloxml data and write back again seems good, but it seems there are currently no way to convert Bio::Tree to PhyloXML. Now, it seems hard to convert Newick data to PhyloXML. Now, to prepare to include your PhyloXML code in BioRuby, I'm working on my branch. Some API changes will be made. http://github.com/ngoto/bioruby/tree/incoming Note that in your test code, argument order of assert_equal is wrong. I've already fixed in my branch. http://github.com/ngoto/bioruby/commit/a291af62ef262ee04f3a0e1b6415d4e256c56a94 > * Updated Tree::add_edge, Tree::clear_edge, Tree::remove_edge, > Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep > track of Node::parent and Node::children nodes correctly. Have I > forgotten anything? Changing root with tree.root=(). -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > Hi all, > > So finally I have updated Bio::Tree and Bio::Node classes to improve > the phyloxml writer speed. > > * Added Bio::Node::parent and Bio::Node::children (array of nodes) in > order to avoid calling Tree::parent(node) or Tree::children(node), > because those methods call breath first search on the underlying > graph, which makes PhyloXML writer and parser incredibly slow. In > contrast, Bio::Node::parent and Bio::Node::children keeps references > to the respective nodes. > * Updated Tree::add_edge, Tree::clear_edge, Tree::remove_edge, > Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep > track of Node::parent and Node::children nodes correctly. Have I > forgotten anything? > * Now for PhyloXML writer it takes less than 1 second instead of > ~20minutes to write ncbi_taxonomy_mollusca.xml file 1.5MB > * To write the tree of life taxonomy file (~46MB) it takes 10 seconds > (On 2.4GHz, 2.9GB RAM, running Ubuntu) > > The code is in http://github.com/latvianlinuxgirl/bioruby/tree/tree_class > > I wrote unit tests for my changes and made sure my changes don't break > anything else. However, does anybody has code laying around that uses > Tree::parent and Tree::children methods so that I can test it more > thoroughly? > > Cheers, > Diana > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From jan.aerts at gmail.com Mon Nov 16 10:11:24 2009 From: jan.aerts at gmail.com (Jan Aerts) Date: Mon, 16 Nov 2009 10:11:24 +0000 Subject: [BioRuby] BioRuby Phyloxml update In-Reply-To: <20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp> References: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com> <20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp> Message-ID: <4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9@mail.gmail.com> All, I think we should make a good effort of merging Diana's code into the bioruby codebase. Even though I'm not completely familiar with bioruby's phylo implementation, an effort like hers should be welcomed with open arms. If her code speeds things up so immensely, why don't we start a new branch that will lead to bioruby 2.0? Let bioruby 2.0 break things. With a major new release things are allowed to be broken free from the legacy code. We definitely don't want Diana's efforts be in vain. jan. 2009/11/8 Naohisa Goto : > Hi Diana, > > I'm sorry that the changes cannot be accepted, because the > modification of existing Bio::Tree methods breaks things. > Bio::Tree does not want to have children/parent information > in nodes. One of the reasons is that it is difficult to keep > consistency when copying a tree. Nodes can be shared with two > or more trees when copying a tree by using "dup" or "clone" > method. > > Normally, tests for existing classes shold not be modified > except when changing specification or the test's bug, because > they guarantee specification of the class. Adding new tests > are OK. > > If you really want nodes to have parent/children information > in each node, please do so in only PhyloXML classes (though > I'm negative). ?In this case, the problem is that reading phyloxml > data and write back again seems good, but it seems there are > currently no way to convert Bio::Tree to PhyloXML. Now, it seems > hard to convert Newick data to PhyloXML. > > Now, to prepare to include your PhyloXML code in BioRuby, I'm working > on my branch. Some API changes will be made. > http://github.com/ngoto/bioruby/tree/incoming > > Note that in your test code, argument order of assert_equal is wrong. > I've already fixed in my branch. > http://github.com/ngoto/bioruby/commit/a291af62ef262ee04f3a0e1b6415d4e256c56a94 > >> * Updated ?Tree::add_edge, Tree::clear_edge, Tree::remove_edge, >> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep >> track of Node::parent and Node::children nodes correctly. ?Have I >> forgotten anything? > > Changing root with tree.root=(). > > -- > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > >> Hi all, >> >> So finally I have updated Bio::Tree and Bio::Node classes to improve >> the phyloxml writer speed. >> >> * Added Bio::Node::parent and ?Bio::Node::children (array of nodes) in >> order to avoid calling Tree::parent(node) or Tree::children(node), >> because those methods call breath first search on the underlying >> graph, which makes PhyloXML writer and parser incredibly slow. In >> contrast, Bio::Node::parent and Bio::Node::children keeps references >> to the respective nodes. >> * Updated ?Tree::add_edge, Tree::clear_edge, Tree::remove_edge, >> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep >> track of Node::parent and Node::children nodes correctly. ?Have I >> forgotten anything? >> * Now for PhyloXML writer it takes less than 1 second instead of >> ~20minutes to write ncbi_taxonomy_mollusca.xml file 1.5MB >> * To write the tree of life taxonomy file (~46MB) it takes 10 seconds >> (On 2.4GHz, 2.9GB RAM, running Ubuntu) >> >> The code is in http://github.com/latvianlinuxgirl/bioruby/tree/tree_class >> >> I wrote unit tests for my changes and made sure my changes don't break >> anything else. However, does anybody has code laying around that uses >> Tree::parent and Tree::children methods so that I can test it more >> thoroughly? >> >> Cheers, >> Diana >> _______________________________________________ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From georgkam at gmail.com Tue Nov 17 05:40:31 2009 From: georgkam at gmail.com (George Githinji) Date: Tue, 17 Nov 2009 08:40:31 +0300 Subject: [BioRuby] BioRuby Digest, Vol 50, Issue 6 In-Reply-To: References: Message-ID: <55915f820911162140w592077f4o448d63e11b4300be@mail.gmail.com> If Ruby itself is known to be slow compared to other interpreters, and Diana;s code speeds up things, as a Bioruby user i would plead with the developers to adopt her code in the next release with the speed optimizations. The next release can only be better if the current code base is overhauled and reviewed based on new developments like Diana's. If Newick can be converted to a format which can then be converted to PhyloXML, then conversion to newick is not a problem. Else I would question the use of Newick format if it cannot be inter-converted to other file formats. On Mon, Nov 16, 2009 at 8:00 PM, wrote: > Send BioRuby mailing list submissions to > bioruby at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/bioruby > or, via email, send a message with subject or body 'help' to > bioruby-request at lists.open-bio.org > > You can reach the person managing the list at > bioruby-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of BioRuby digest..." > > > Today's Topics: > > 1. Re: BioRuby Phyloxml update (Jan Aerts) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 16 Nov 2009 10:11:24 +0000 > From: Jan Aerts > Subject: Re: [BioRuby] BioRuby Phyloxml update > To: Naohisa Goto > Cc: phyloxml at yahoogroups.com, Pjotr Prins , > bioruby at lists.open-bio.org, Diana Jaunzeikare > Message-ID: > <4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > All, > > I think we should make a good effort of merging Diana's code into the > bioruby codebase. Even though I'm not completely familiar with > bioruby's phylo implementation, an effort like hers should be welcomed > with open arms. > > If her code speeds things up so immensely, why don't we start a new > branch that will lead to bioruby 2.0? Let bioruby 2.0 break things. > With a major new release things are allowed to be broken free from the > legacy code. > > We definitely don't want Diana's efforts be in vain. > > jan. > > 2009/11/8 Naohisa Goto : > > Hi Diana, > > > > I'm sorry that the changes cannot be accepted, because the > > modification of existing Bio::Tree methods breaks things. > > Bio::Tree does not want to have children/parent information > > in nodes. One of the reasons is that it is difficult to keep > > consistency when copying a tree. Nodes can be shared with two > > or more trees when copying a tree by using "dup" or "clone" > > method. > > > > Normally, tests for existing classes shold not be modified > > except when changing specification or the test's bug, because > > they guarantee specification of the class. Adding new tests > > are OK. > > > > If you really want nodes to have parent/children information > > in each node, please do so in only PhyloXML classes (though > > I'm negative). ?In this case, the problem is that reading phyloxml > > data and write back again seems good, but it seems there are > > currently no way to convert Bio::Tree to PhyloXML. Now, it seems > > hard to convert Newick data to PhyloXML. > > > > Now, to prepare to include your PhyloXML code in BioRuby, I'm working > > on my branch. Some API changes will be made. > > http://github.com/ngoto/bioruby/tree/incoming > > > > Note that in your test code, argument order of assert_equal is wrong. > > I've already fixed in my branch. > > > http://github.com/ngoto/bioruby/commit/a291af62ef262ee04f3a0e1b6415d4e256c56a94 > > > >> * Updated ?Tree::add_edge, Tree::clear_edge, Tree::remove_edge, > >> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep > >> track of Node::parent and Node::children nodes correctly. ?Have I > >> forgotten anything? > > > > Changing root with tree.root=(). > > > > -- > > Naohisa Goto > > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > > > > >> Hi all, > >> > >> So finally I have updated Bio::Tree and Bio::Node classes to improve > >> the phyloxml writer speed. > >> > >> * Added Bio::Node::parent and ?Bio::Node::children (array of nodes) in > >> order to avoid calling Tree::parent(node) or Tree::children(node), > >> because those methods call breath first search on the underlying > >> graph, which makes PhyloXML writer and parser incredibly slow. In > >> contrast, Bio::Node::parent and Bio::Node::children keeps references > >> to the respective nodes. > >> * Updated ?Tree::add_edge, Tree::clear_edge, Tree::remove_edge, > >> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep > >> track of Node::parent and Node::children nodes correctly. ?Have I > >> forgotten anything? > >> * Now for PhyloXML writer it takes less than 1 second instead of > >> ~20minutes to write ncbi_taxonomy_mollusca.xml file 1.5MB > >> * To write the tree of life taxonomy file (~46MB) it takes 10 seconds > >> (On 2.4GHz, 2.9GB RAM, running Ubuntu) > >> > >> The code is in > http://github.com/latvianlinuxgirl/bioruby/tree/tree_class > >> > >> I wrote unit tests for my changes and made sure my changes don't break > >> anything else. However, does anybody has code laying around that uses > >> Tree::parent and Tree::children methods so that I can test it more > >> thoroughly? > >> > >> Cheers, > >> Diana > >> _______________________________________________ > >> BioRuby mailing list > >> BioRuby at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > > _______________________________________________ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > > > > ------------------------------ > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > > > End of BioRuby Digest, Vol 50, Issue 6 > ************************************** > -- --------------- Sincerely George Skype: george_g2 Blog: http://biorelated.wordpress.com/ From djaunzei at smith.edu Tue Nov 17 14:52:59 2009 From: djaunzei at smith.edu (Diana Jaunzeikare) Date: Tue, 17 Nov 2009 09:52:59 -0500 Subject: [BioRuby] BioRuby Phyloxml update In-Reply-To: <4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9@mail.gmail.com> References: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com> <20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp> <4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9@mail.gmail.com> Message-ID: <4057d3bf0911170652o3683768ax366df83afcfa48e0@mail.gmail.com> Thanks for discussion. I see Naohisa's point that it is difficult to keep consistency when copying a tree. Right now PhyloXML class inherits from Bio::Tree class. Instead, I could write a new general Bio::FamilyTree class (per Pjotr's suggestion), which would be strictly a tree (I believe that Bio::Tree allows for a node to have 2 parents) and would have parent/child information. Thus it would not need underlying general graph implementation, therefore making the implementation simpler than that of Bio::Tree. Then PhyloXML::Tree would inherit from Bio::FamilyTree. This way PhyloXML writer probably would be even faster because it would not need to update Bio::Pathway structure (which is under Bio::Tree) every time adding a node or edge. Additionally, I think BioRuby would benefit from general Bio::FamilyTree class. I recently heard a talk by researcher who did phylogenetic analysis of musical rhythms. Also I will write method to convert from newick to PhyloXML. What do you think? Cheers, Diana On Mon, Nov 16, 2009 at 5:11 AM, Jan Aerts wrote: > All, > > I think we should make a good effort of merging Diana's code into the > bioruby codebase. Even though I'm not completely familiar with > bioruby's phylo implementation, an effort like hers should be welcomed > with open arms. > > If her code speeds things up so immensely, why don't we start a new > branch that will lead to bioruby 2.0? Let bioruby 2.0 break things. > With a major new release things are allowed to be broken free from the > legacy code. > > We definitely don't want Diana's efforts be in vain. > > jan. > > 2009/11/8 Naohisa Goto : >> Hi Diana, >> >> I'm sorry that the changes cannot be accepted, because the >> modification of existing Bio::Tree methods breaks things. >> Bio::Tree does not want to have children/parent information >> in nodes. One of the reasons is that it is difficult to keep >> consistency when copying a tree. Nodes can be shared with two >> or more trees when copying a tree by using "dup" or "clone" >> method. >> >> Normally, tests for existing classes shold not be modified >> except when changing specification or the test's bug, because >> they guarantee specification of the class. Adding new tests >> are OK. >> >> If you really want nodes to have parent/children information >> in each node, please do so in only PhyloXML classes (though >> I'm negative). ?In this case, the problem is that reading phyloxml >> data and write back again seems good, but it seems there are >> currently no way to convert Bio::Tree to PhyloXML. Now, it seems >> hard to convert Newick data to PhyloXML. >> >> Now, to prepare to include your PhyloXML code in BioRuby, I'm working >> on my branch. Some API changes will be made. >> http://github.com/ngoto/bioruby/tree/incoming >> >> Note that in your test code, argument order of assert_equal is wrong. >> I've already fixed in my branch. >> http://github.com/ngoto/bioruby/commit/a291af62ef262ee04f3a0e1b6415d4e256c56a94 >> >>> * Updated ?Tree::add_edge, Tree::clear_edge, Tree::remove_edge, >>> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep >>> track of Node::parent and Node::children nodes correctly. ?Have I >>> forgotten anything? >> >> Changing root with tree.root=(). >> >> -- >> Naohisa Goto >> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org >> >> >>> Hi all, >>> >>> So finally I have updated Bio::Tree and Bio::Node classes to improve >>> the phyloxml writer speed. >>> >>> * Added Bio::Node::parent and ?Bio::Node::children (array of nodes) in >>> order to avoid calling Tree::parent(node) or Tree::children(node), >>> because those methods call breath first search on the underlying >>> graph, which makes PhyloXML writer and parser incredibly slow. In >>> contrast, Bio::Node::parent and Bio::Node::children keeps references >>> to the respective nodes. >>> * Updated ?Tree::add_edge, Tree::clear_edge, Tree::remove_edge, >>> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep >>> track of Node::parent and Node::children nodes correctly. ?Have I >>> forgotten anything? >>> * Now for PhyloXML writer it takes less than 1 second instead of >>> ~20minutes to write ncbi_taxonomy_mollusca.xml file 1.5MB >>> * To write the tree of life taxonomy file (~46MB) it takes 10 seconds >>> (On 2.4GHz, 2.9GB RAM, running Ubuntu) >>> >>> The code is in http://github.com/latvianlinuxgirl/bioruby/tree/tree_class >>> >>> I wrote unit tests for my changes and made sure my changes don't break >>> anything else. However, does anybody has code laying around that uses >>> Tree::parent and Tree::children methods so that I can test it more >>> thoroughly? >>> >>> Cheers, >>> Diana >>> _______________________________________________ >>> BioRuby mailing list >>> BioRuby at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioruby >> >> _______________________________________________ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby >> > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From ngoto at gen-info.osaka-u.ac.jp Tue Nov 17 16:27:46 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 18 Nov 2009 01:27:46 +0900 Subject: [BioRuby] BioRuby Phyloxml update In-Reply-To: <4057d3bf0911170652o3683768ax366df83afcfa48e0@mail.gmail.com> References: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com> <20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp> <4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9@mail.gmail.com> <4057d3bf0911170652o3683768ax366df83afcfa48e0@mail.gmail.com> Message-ID: <20091117162747.C1CD81CBC41B@idnmail.gen-info.osaka-u.ac.jp> Hi, I've just committed speed-up of Bio::Tree#children in my repository. It keeps compatibility. Trade-off for the speed-up, memory consumption is a little bit larger than the previous code. http://github.com/ngoto/bioruby For the benchmark of reading and writing big PhyloXML code, based on Diana's test_phyloxml_big.rb, a new sample code is added as sample/test_phyloxml_big.rb. Running the new sample/test_phyloxml_big.rb on a machine (Pentium D 3.40GHz, memory 4GB, running Debian GNU/Linux) with http://github.com/ngoto/bioruby: 47.52user 0.93system 0:50.09elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+141424outputs (0major+167550minor)pagefaults 0swaps with http://github.com/latvianlinuxgirl/bioruby/tree/tree_class 43.55user 1.00system 0:46.59elapsed 95%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+141424outputs (0major+165151minor)pagefaults 0swaps Although my new code is still ~10% slower than Diana's new code, I think it can be acceptable because my code keeps compatibility. I wrote Bio::Tree because I want to manipulate trees flexibly, e.g. merging and splitting trees, changing root of trees. For the purpose, I didn't take the way to have parent/children in a node. I also think the current Bio::Tree is not the best. One of the weak points is it is relatively heavy. The flexibility may not be needed for parsers only representing fixed data structure. New class seems attractive for usages that can not be coverd with the current Bio::Tree implementation. Thanks, Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Tue, 17 Nov 2009 09:52:59 -0500 Diana Jaunzeikare wrote: > Thanks for discussion. I see Naohisa's point that it is difficult to > keep consistency when copying a tree. > > Right now PhyloXML class inherits from Bio::Tree class. Instead, I > could write a new general Bio::FamilyTree class (per Pjotr's > suggestion), which would be strictly a tree (I believe that Bio::Tree > allows for a node to have 2 parents) and would have parent/child > information. Thus it would not need underlying general graph > implementation, therefore making the implementation simpler than that > of Bio::Tree. Then PhyloXML::Tree would inherit from Bio::FamilyTree. > This way PhyloXML writer probably would be even faster because it > would not need to update Bio::Pathway structure (which is under > Bio::Tree) every time adding a node or edge. > Additionally, I think BioRuby would benefit from general > Bio::FamilyTree class. I recently heard a talk by researcher who did > phylogenetic analysis of musical rhythms. > > Also I will write method to convert from newick to PhyloXML. > > What do you think? > > Cheers, > Diana From tomoakin at kenroku.kanazawa-u.ac.jp Wed Nov 18 00:24:34 2009 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Wed, 18 Nov 2009 09:24:34 +0900 Subject: [BioRuby] BioRuby Phyloxml update In-Reply-To: <20091117162747.C1CD81CBC41B@idnmail.gen-info.osaka-u.ac.jp> References: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com> <20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp> <4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9@mail.gmail.com> <4057d3bf0911170652o3683768ax366df83afcfa48e0@mail.gmail.com> <20091117162747.C1CD81CBC41B@idnmail.gen-info.osaka-u.ac.jp> Message-ID: Hi, One point seems that tree can be unrooted or rooted. Perhaps, Goto-san's Bio::Tree represents unrooted tree (not distinguishing parents and childrenn), while Diana's class is for rooted trees (having distinction of parents and children). If, this is the point, Bio::RootedTree is better name than Bio::FamilyTree. In general, rooted tree should be easily converted to unrooted tree, while conversion of an unrooted tree to rooted tree requires specification of the root. For text representation like NEWICK there is anyway a root while the tree can be interpreted either as rooted or unrooted. It could be good to have distinct interface for rooted and unrooted trees, to let the user's be aware of the conceptual difference. -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan From tomoakin at kenroku.kanazawa-u.ac.jp Thu Nov 19 00:33:32 2009 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Thu, 19 Nov 2009 09:33:32 +0900 Subject: [BioRuby] Blast to Phylogeny In-Reply-To: <4B045622.8040204@broadinstitute.org> References: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com> <20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp> <4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9@mail.gmail.com> <4057d3bf0911170652o3683768ax366df83afcfa48e0@mail.gmail.com> <20091117162747.C1CD81CBC41B@idnmail.gen-info.osaka-u.ac.jp> <4B045622.8040204@broadinstitute.org> Message-ID: Hi, In general, to construct a phylogenetic tree from molecular sequence data, you will collect the homologous sequences, perform multiple alignment, identify the region that will be used for the reconstruction, and then pass the data to an appropriate program to reconstruct the phylogeny. If I have a BLAST output, I would parse that file with Bio::FlatFile and extract the identifiers of the hit sequences, use the identifiers to collect individual sequences and submit the sequences to mafft for multiple alignment. Convert the alignment to nexus format and manually check with MacClade, and then parse the edited nexus file to write the multiple alignment readable by the phylogenetic analysis program. There are many options you can take at each step. So, there are multiple ways, but not a single simple way. :( Bioruby has support for multiple alignment programs like mafft, muscle, and clustalw. For phylogenetic reconstruction, there is some support for phylip and paml (I don't have tried these feature from Bioruby library, though). There are a number of programs for phylogenetic analysis other than phylip and paml. A list compiled by J. Felsenstein is available at http://evolution.genetics.washington.edu/phylip/software.html An alignment similar to that of phylip will be accepted by most programs. -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan On 2009/11/19, at 5:16, Sharvari Gujja wrote: > Hi, > > I am trying to construct a phylogenetic tree from Blast > output...Could you please let me know if there is a way to do > this..I have also been looking at Bio::Tree documentation but it is > not clear if it accepts Blast file as input. > > Appreciate any help. > > Thanks > Sharvari From robert.citek at gmail.com Thu Nov 19 20:06:22 2009 From: robert.citek at gmail.com (Robert Citek) Date: Thu, 19 Nov 2009 15:06:22 -0500 Subject: [BioRuby] custom blast scoring matrix Message-ID: <4145b6790911191206r53c86818m280e3a149f9293ec@mail.gmail.com> Hello all, I would like to create a custom BLAST scoring matrix that I can use with NCBI's blastall. For example, let's say I want to create a modified BLOSUM62 matrix called BLOSUM62ar, where the A:R score is now 2 instead of -1. Some questions that I have: 1) is this possible? 2) if it is, where can I find documentation which describes how to do this? 3) is the blast output different from a regular blast? 4) if it is different, does bio-ruby have blast parsers that can parse the output? Thanks in advance for any pointers and suggestions. Regards, - Robert From georgkam at gmail.com Sat Nov 21 08:58:53 2009 From: georgkam at gmail.com (George Githinji) Date: Sat, 21 Nov 2009 11:58:53 +0300 Subject: [BioRuby] custom blast scoring matrix Message-ID: <55915f820911210058g6e089f76wc5c9cbfdad3e9966@mail.gmail.com> Hi Martin, Thanks for bringing the topic on list. Sometimes back i was also very interested in custom matrices for NCBI blast. Making custom Matrices is possible. check this out BMC Bioinformatics 2008, 9:236 doi:10.1186/1471-2105-9-236 However making your matrices work with NCBI blast is slightly difficult as you need to recompile the BLAST program and incoporate your modifications. I found this a little bit not so straighforward. Lack of good documentation. I wonder whether there is someone who has implemented the BLAST algorithm in Ruby. (The argument is usually that the C implementation is very optimized and good, so why would one want to implement it in ruby?) though i would not buy that argument for learning purposes. The closest i came to a BLAST algorithm is an implementation of it in Perl, in the book Genomic Perl by Rex A. Dwyer, He also outlines how to create your own matrices with code listings in perl. Please ping me back if you get more resources. :) George On Fri, Nov 20, 2009 at 8:00 PM, wrote: > Send BioRuby mailing list submissions to > bioruby at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/bioruby > or, via email, send a message with subject or body 'help' to > bioruby-request at lists.open-bio.org > > You can reach the person managing the list at > bioruby-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of BioRuby digest..." > > > Today's Topics: > > 1. custom blast scoring matrix (Robert Citek) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 19 Nov 2009 15:06:22 -0500 > From: Robert Citek > Subject: [BioRuby] custom blast scoring matrix > To: bioruby > Message-ID: > <4145b6790911191206r53c86818m280e3a149f9293ec at mail.gmail.com> > Content-Type: text/plain; charset=UTF-8 > > Hello all, > > I would like to create a custom BLAST scoring matrix that I can use > with NCBI's blastall. For example, let's say I want to create a > modified BLOSUM62 matrix called BLOSUM62ar, where the A:R score is now > 2 instead of -1. > > Some questions that I have: > > 1) is this possible? > 2) if it is, where can I find documentation which describes how to do this? > 3) is the blast output different from a regular blast? > 4) if it is different, does bio-ruby have blast parsers that can parse > the output? > > Thanks in advance for any pointers and suggestions. > > Regards, > - Robert > > > ------------------------------ > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > > > End of BioRuby Digest, Vol 50, Issue 10 > *************************************** > -- --------------- Sincerely George Skype: george_g2 Blog: http://biorelated.wordpress.com/ From robert.citek at gmail.com Sun Nov 22 13:55:58 2009 From: robert.citek at gmail.com (Robert Citek) Date: Sun, 22 Nov 2009 08:55:58 -0500 Subject: [BioRuby] custom blast scoring matrix In-Reply-To: <55915f820911210058g6e089f76wc5c9cbfdad3e9966@mail.gmail.com> References: <55915f820911210058g6e089f76wc5c9cbfdad3e9966@mail.gmail.com> Message-ID: <4145b6790911220555q410187fak9f8b1b66e4a0ddf2@mail.gmail.com> On Sat, Nov 21, 2009 at 3:58 AM, George Githinji wrote: > Thanks for bringing the topic on list. Sometimes back i was also very > interested in custom matrices for NCBI blast. > Making custom Matrices is possible. check this out > BMC Bioinformatics 2008, 9:236 doi:10.1186/1471-2105-9-236 Thanks for the citation. I'll have a look into that. > However making your matrices work with NCBI blast is slightly difficult as > you need to recompile the BLAST program and incoporate your modifications. I > found this a little bit not so straighforward. Lack of good documentation. That's unfortunate. I've tried compiling NCBI blast a few times in the past and don't ever recall having success with it, running into the same issues you describe. But it's been a while and maybe the process has become easier. I'll give it a whirl. > I wonder whether there is someone who has implemented the BLAST algorithm in > Ruby. (The argument is usually that the C implementation is very optimized > and good, so why would one want to implement it in ruby?) though i would not > buy that argument for learning purposes. ?The closest i came to a BLAST > algorithm is an implementation of it in Perl, in the book Genomic Perl by > Rex A. Dwyer, He also outlines how to create your own matrices with code > listings in perl. Thanks. I'll have a look at that as well. > Please ping me back if you get more resources. :) Will do. Regards, - Robert From pjotr.public14 at thebird.nl Thu Nov 26 13:08:30 2009 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Thu, 26 Nov 2009 14:08:30 +0100 Subject: [BioRuby] Ruby EMBOSS mapping (using Biolib) Message-ID: <20091126130830.GA19003@thebird.nl> Hi all, The last year I have been working on C library mappings to Ruby. A comparison of Bioruby against Biolib/EMBOSS six frame translation of a C.elegans dataset shows the Ruby with EMBOSS version is about 30x faster. On my (outdated) machine: Bioruby version: 22929 records 137574 times translated! real 9m30.952s user 8m42.877s sys 0m32.878s Biolib version: 22929 records 137574 times translated! real 0m20.306s user 0m15.997s sys 0m1.344s This is including IO - which is handled by Ruby. The Bioruby code reads: nt = FastaReader.new(fn) nt.each { | rec | seq = Bio::Sequence::NA.new(rec.seq) [-3,-2,-1,1,2,3].each do | frame | print "> ",rec.id," ",frame.to_s,"\n" print seq.translate(frame),"\n" end } $stderr.print nt.size," records ",nt.size*6*iter," times translated!" The Biolib code reads nt = FastaReader.new(fn) trnTable = Biolib::Emboss.ajTrnNewI(1); nt.each { | rec | ajpseq = Biolib::Emboss.ajSeqNewNameC(rec.seq,"Test sequence") [-3,-2,-1,1,2,3].each do | frame | ajpseqt = Biolib::Emboss.ajTrnSeqOrig(trnTable,ajpseq,frame) aa = Biolib::Emboss.ajSeqGetSeqCopyC(ajpseqt) print "> ",rec.id," ",frame.to_s,"\n" print aa,"\n" end } $stderr.print nt.size," records ",nt.size*6*iter," times translated!" A write up of the mapping effort is at: http://biolib.open-bio.org/wiki/Mapping_EMBOSS From pjotr.public14 at thebird.nl Thu Nov 26 13:44:27 2009 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Thu, 26 Nov 2009 14:44:27 +0100 Subject: [BioRuby] Announcing BigBio project for Ruby Message-ID: <20091126134427.GA20660@thebird.nl> BigBio = BIG DATA computing (for Ruby) BigBio is an initiative to a create high performance libraries for big data computing in biology - initially for the Ruby language. The Ruby version of BioBig uses BioRuby, when sensible, but provides an interface with a different design. Also, unlike BioRuby which aims to be pure Ruby, it uses BioLib C/C++ functions for increased performance and reduced memory consumption. The first module is an (indexed) FastaReader which does not load the full FASTA file in memory. http://github.com/pjotrp/bigbio Pj. From jan.aerts at gmail.com Thu Nov 26 13:44:58 2009 From: jan.aerts at gmail.com (Jan Aerts) Date: Thu, 26 Nov 2009 13:44:58 +0000 Subject: [BioRuby] VCF Message-ID: <4c7507a70911260544j4ba5f089y38c76d4f48131258@mail.gmail.com> Is anyone working on a VCF (Variant Call Format) parser in bioruby? http://1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcfv3.2 From jan.aerts at gmail.com Thu Nov 26 13:46:52 2009 From: jan.aerts at gmail.com (Jan Aerts) Date: Thu, 26 Nov 2009 13:46:52 +0000 Subject: [BioRuby] Announcing BigBio project for Ruby In-Reply-To: <20091126134427.GA20660@thebird.nl> References: <20091126134427.GA20660@thebird.nl> Message-ID: <4c7507a70911260546w45839e7fra4a2565a66bc47ff@mail.gmail.com> Interesting... Planning to incorporate SAM/BAM alignment formats for nextgen sequences? jan. 2009/11/26 Pjotr Prins : > BigBio = BIG DATA computing (for Ruby) > > BigBio is an initiative to a create high performance libraries for big data > computing in biology - initially for the Ruby language. > > The Ruby version of BioBig uses BioRuby, when sensible, but provides an > interface with a different design. Also, unlike BioRuby which aims to be pure > Ruby, it uses BioLib C/C++ functions for increased performance and reduced > memory consumption. > > The first module is an (indexed) FastaReader which does not load the > full FASTA file in memory. > > http://github.com/pjotrp/bigbio > > Pj. > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From jan.aerts at gmail.com Thu Nov 26 13:52:16 2009 From: jan.aerts at gmail.com (Jan Aerts) Date: Thu, 26 Nov 2009 13:52:16 +0000 Subject: [BioRuby] Bio::DB::Sam Message-ID: <4c7507a70911260552u5ff78223r4905a27850e20179@mail.gmail.com> And another parser that probably should be added to bioruby: something to interact with SAM/BAM files (which contain mapping positions for short reads). More info at samtools.sourceforge.net Lincoln has written a nice API for perl: Bio::DB::Sam. Maybe we should go for something similar? http://search.cpan.org/~lds/Bio-SamTools-1.07/lib/Bio/DB/Sam.pm Is anyone already working on this? jan. From pjotr.public14 at thebird.nl Thu Nov 26 14:17:03 2009 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Thu, 26 Nov 2009 15:17:03 +0100 Subject: [BioRuby] Bio::DB::Sam In-Reply-To: <4c7507a70911260552u5ff78223r4905a27850e20179@mail.gmail.com> References: <4c7507a70911260552u5ff78223r4905a27850e20179@mail.gmail.com> Message-ID: <20091126141703.GA21032@thebird.nl> On Thu, Nov 26, 2009 at 01:52:16PM +0000, Jan Aerts wrote: > And another parser that probably should be added to bioruby: something > to interact with SAM/BAM files (which contain mapping positions for > short reads). More info at samtools.sourceforge.net by the looks of it - it should be relatively easy with SWIG - and therefore Biolib. > Lincoln has written a nice API for perl: Bio::DB::Sam. Maybe we should > go for something similar? > http://search.cpan.org/~lds/Bio-SamTools-1.07/lib/Bio/DB/Sam.pm Wow, this guy is hard core! Doing this with PerlXS takes a *lot* of effort. XS is sooooo nineties ;-) > Is anyone already working on this? I am happy to write a SWIG mapper. If someone really cares to use it and will write the higher-level Ruby interface (nice OOP class representation). I have been told Bioruby is pure Ruby - so this will not fit in. Pj. From biopython at maubp.freeserve.co.uk Thu Nov 26 16:02:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Nov 2009 16:02:50 +0000 Subject: [BioRuby] Fwd: [DAS] DAS workshop 7th-9th April 2010 In-Reply-To: References: Message-ID: <320fb6e00911260802wb98b28fic8a193c125e29d9c@mail.gmail.com> This might be of interest to some of you. Peter ---------- Forwarded message ---------- From: Jonathan Warren Date: Thu, Nov 26, 2009 at 2:57 PM Subject: [DAS] DAS workshop 7th-9th April 2010 To: das at biodas.org, das_registry_announce at sanger.ac.uk, biojava-dev , BioJava , BioPerl , all at sanger.ac.uk, all at ebi.ac.uk, ensembldev We are considering running a Distributed Annotation System workshop here at the Sanger/EBI in the UK subject to decent demand. The workshop will be held from Wednesday 7th-Friday 9th April 2010. If you would be interested in attending either to present or just take part then please email me jw12 at sanger.ac.uk The format of the workshop is likely to be similar to last years (1st day for beginners, 2nd for both beginners and advanced users, 3rd day for advanced), information for which can be found here: http://www.dasregistry.org/course.jsp If you would like to present then please send a short summary of what you would like to talk about. Thanks Jonathan. Jonathan Warren Senior Developer and DAS coordinator jw12 at sanger.ac.uk -- The Wellcome Trust Sanger Institute is operated by Genome ResearchLimited, a charity registered in England with number 1021457 and acompany registered in England with number 2742969, whose registeredoffice is 215 Euston Road, London, NW1 2BE._______________________________________________ DAS mailing list DAS at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/das From josejotero at gmail.com Sat Nov 28 02:55:38 2009 From: josejotero at gmail.com (Jose Otero) Date: Fri, 27 Nov 2009 18:55:38 -0800 Subject: [BioRuby] Bio::GenBank Message-ID: Hello all, I'm new to BioRUby and I am trying to adapt the BioGenbank class to store information of my plasmid database. Question 1: Does anybody know how to insert a nucleic acid sequence as the value to 'sequence' in the @data object? Placing in inoformation as Bio::Feature::Qualifier objects is easy, as is inserting Bio::Locus information. But I can't figure how to insert the sequence data. Question 2: Has anybody ever changed the data from a BioGenbank object and save the altered file? This would be very interesting for my plasmid database. Thanks for the help. JO From ngoto at gen-info.osaka-u.ac.jp Sat Nov 28 09:00:01 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Sat, 28 Nov 2009 18:00:01 +0900 Subject: [BioRuby] Bio::GenBank In-Reply-To: References: Message-ID: <20091128090002.372041CBC49E@idnmail.gen-info.osaka-u.ac.jp> Hello Jose, On Fri, 27 Nov 2009 18:55:38 -0800 Jose Otero wrote: > Hello all, > I'm new to BioRUby and I am trying to adapt the BioGenbank class to store > information of my plasmid database. > Question 1: Does anybody know how to insert a nucleic acid sequence as the > value to 'sequence' in the @data object? > Placing in inoformation as Bio::Feature::Qualifier objects is easy, as is > inserting Bio::Locus information. But I can't figure how to insert the > sequence data. Once an object of the Bio::GenBank class is created, each data stored in the object is intended to be read-only, though modification is not explicitly prohibited. This is because the class is designed for efficient parsing of the GenBank formatted text, and it is technically not easy to achieve both efficient parsing and flexible modification. (This is also applied to most parser classes, e.g. Bio::EMBL, Bio::SPTR, etc.) In your case, using Bio::Sequence seems the best way. After converted to Bio::Sequence object, from a Bio::GenBank object, it can be freely modified. # Assume str contains GenBank formatted text as String. # # Creating a new Bio::GenBank object. gb = Bio::GenBank.new(str) # Converting to Bio::Sequence object s = gb.to_biosequence # Modifying the sequence. # # Note that other attributes, such as features and references # (which depend on locations on the sequence) are kept unchanged. # Relocation of the features, references, etc. is relied on the # user. # s.seq = 'atgc' * 10 + s.seq # Text formatting as the GenBank format. puts s.output(:genbank) Creating a new Bio::Sequence object from scratch, giving definition, accessions, keywords, references, features, etc., and getting GenBank-formatted text can also be done. > Question 2: Has anybody ever changed the data from a BioGenbank object and > save the altered file? This would be very interesting for my plasmid > database. As described above, Bio::Sequence#output can be used. The method returns formatted text as String, and you can easily write it to a file. > Thanks for the help. > JO Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org