From yannick.wurm at unil.ch  Tue Nov  3 09:11:52 2009
From: yannick.wurm at unil.ch (Yannick Wurm)
Date: Tue, 3 Nov 2009 15:11:52 +0100
Subject: [BioRuby] Ruby speed
In-Reply-To: <mailman.11.1256227205.8317.bioruby@lists.open-bio.org>
References: <mailman.11.1256227205.8317.bioruby@lists.open-bio.org>
Message-ID: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch>

Hi,

this is a more general ruby question, but since my application is  
bioinformatics, I'm posting it here.

Just wanted to prepend a few characters in front of FASTA identifiers.


$time cat clustering/dirsForAssembly/singlets.fasta | ruby -pe "gsub(/ 
^>/, '>MyPrefix')" > abc
	real	0m20.379s
	user	0m0.741s
	sys	0m0.168s


While the perl equivalent is one heck of a lot faster!!!


$time cat clustering/dirsForAssembly/singlets.fasta | perl -p -i -e 's/ 
^>/>MyPrefix/g' > ab
	real	0m2.165s
	user	0m0.266s
	sys	0m0.146s


Is there any hope for ruby?

Thanks,
yannick


--------------------------------------------
           yannick . wurm @ unil . ch
Ant Genomics, Ecology & Evolution @ Lausanne
    http://www.unil.ch/dee/page28685_fr.html


From yannick.wurm at unil.ch  Tue Nov  3 17:49:12 2009
From: yannick.wurm at unil.ch (Yannick Wurm)
Date: Tue, 3 Nov 2009 23:49:12 +0100
Subject: [BioRuby] Ruby speed
In-Reply-To: <c27b73c0911031326q7909041cx715927ccc4d487ad@mail.gmail.com>
References: <mailman.11.1256227205.8317.bioruby@lists.open-bio.org>
	<22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch>
	<c27b73c0911031326q7909041cx715927ccc4d487ad@mail.gmail.com>
Message-ID: <626B2E8A-8E3B-4643-8104-DC1143121523@unil.ch>

Hi Mike,

thanks for your response. I'm running:
ruby 1.8.6 (2008-03-03 patchlevel 114) [x86_64-linux]
Starting to age, but on a production machine I'd rather stay with what  
works than risk breaking things by upgrading them.

the command sed 's/^>/>MyPrefix/' is indeed 30% faster than perl :)

My reasons for preferring ruby are the same as yours. But a 5 to 10x  
speed difference is expensive  (I'm calling the one-liner below about  
10,000 times from a larger ruby script - YES, it's ugly, but  
refactoring the script to avoid calling that type of oneliner would be  
a pain since I use 10,000 different prefixes).

I have the feeling that it's ruby's startup-time especially. Running  
the ruby one-liner my a fasta of 40,000 sequences takes 20 seconds;  
running it a fasta of only 10 lines still takes 13 seconds!!

I found some generic benchmarks indicating that ruby is generally only  
a bit slower than perl
http://shootout.alioth.debian.org/u32/benchmark.php?test=all&lang=ruby&lang2=perl

So maybe I can keep using ruby - just avoiding one-liners!

Best,
yannick

On 3 Nov 2009, at 22:26, Michael Barton wrote:

> What version of Ruby are you using?
> Ruby is an expressive language rather than a "fast" language.
> I use Ruby because it's easer to read and maintain my programs, rather
> than because how fast it is.
>
> If you are interested purely in speed you could write in C?
> What are the benchmarks for something like this?
>
> time sed 's/^>/>MyPrefix.' clustering/dirsForAssembly/singlets.fasta  
> > abc
>
> Mike
>
> 2009/11/3 Yannick Wurm <yannick.wurm at unil.ch>:
>> Hi,
>>
>> this is a more general ruby question, but since my application is
>> bioinformatics, I'm posting it here.
>>
>> Just wanted to prepend a few characters in front of FASTA  
>> identifiers.
>>
>>
>> $time cat clustering/dirsForAssembly/singlets.fasta | ruby -pe  
>> "gsub(/^>/,
>> '>MyPrefix')" > abc
>>        real    0m20.379s
>>        user    0m0.741s
>>        sys     0m0.168s
>>
>>
>> While the perl equivalent is one heck of a lot faster!!!
>>
>>
>> $time cat clustering/dirsForAssembly/singlets.fasta | perl -p -i -e
>> 's/^>/>MyPrefix/g' > ab
>>        real    0m2.165s
>>        user    0m0.266s
>>        sys     0m0.146s
>>
>>
>> Is there any hope for ruby?
>>
>> Thanks,
>> yannick
>>
>>
>> --------------------------------------------
>>          yannick . wurm @ unil . ch
>> Ant Genomics, Ecology & Evolution @ Lausanne
>>   http://www.unil.ch/dee/page28685_fr.html
>>
>>
>> _______________________________________________
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
>>


From juanfc at uma.es  Tue Nov  3 17:44:10 2009
From: juanfc at uma.es (Juan Falgueras)
Date: Tue, 3 Nov 2009 23:44:10 +0100
Subject: [BioRuby] Ruby speed
In-Reply-To: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch>
References: <mailman.11.1256227205.8317.bioruby@lists.open-bio.org>
	<22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch>
Message-ID: <87CAA48B-151F-41C3-9DF5-23C4B43BDFD0@uma.es>


	Hi, have you tried it with Ruby 1.9?


El 03/11/2009, a las 15:11, Yannick Wurm escribi?:

> Hi,
>
> this is a more general ruby question, but since my application is  
> bioinformatics, I'm posting it here.
>
> Just wanted to prepend a few characters in front of FASTA identifiers.
>
>
> $time cat clustering/dirsForAssembly/singlets.fasta | ruby -pe "gsub 
> (/^>/, '>MyPrefix')" > abc
> 	real	0m20.379s
> 	user	0m0.741s
> 	sys	0m0.168s
>
>
> While the perl equivalent is one heck of a lot faster!!!
>
>
> $time cat clustering/dirsForAssembly/singlets.fasta | perl -p -i -e  
> 's/^>/>MyPrefix/g' > ab
> 	real	0m2.165s
> 	user	0m0.266s
> 	sys	0m0.146s
>
>
> Is there any hope for ruby?
>
> Thanks,
> yannick
>
>
> --------------------------------------------
>          yannick . wurm @ unil . ch
> Ant Genomics, Ecology & Evolution @ Lausanne
>   http://www.unil.ch/dee/page28685_fr.html
>
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From trevor at corevx.com  Tue Nov  3 18:18:50 2009
From: trevor at corevx.com (Trevor Wennblom)
Date: Tue, 3 Nov 2009 17:18:50 -0600
Subject: [BioRuby] Ruby speed
In-Reply-To: <626B2E8A-8E3B-4643-8104-DC1143121523@unil.ch>
References: <mailman.11.1256227205.8317.bioruby@lists.open-bio.org>
	<22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch>
	<c27b73c0911031326q7909041cx715927ccc4d487ad@mail.gmail.com>
	<626B2E8A-8E3B-4643-8104-DC1143121523@unil.ch>
Message-ID: <CE266AD0-6CAD-40CE-93AD-53B86C35F8D8@corevx.com>


On Nov 3, 2009, at 4:49 PM, Yannick Wurm wrote:

> I found some generic benchmarks indicating that ruby is generally  
> only a bit slower than perl
> http://shootout.alioth.debian.org/u32/benchmark.php?test=all&lang=ruby&lang2=perl

http://shootout.alioth.debian.org/u32/benchmark.php?test=all&lang=yarv&lang2=perl&box=1

From robert.citek at gmail.com  Tue Nov  3 20:32:12 2009
From: robert.citek at gmail.com (Robert Citek)
Date: Tue, 3 Nov 2009 20:32:12 -0500
Subject: [BioRuby] Ruby speed
In-Reply-To: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch>
References: <mailman.11.1256227205.8317.bioruby@lists.open-bio.org>
	<22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch>
Message-ID: <4145b6790911031732m731d0b09o199041ab0feb610c@mail.gmail.com>

On Tue, Nov 3, 2009 at 9:11 AM, Yannick Wurm <yannick.wurm at unil.ch> wrote:
> this is a more general ruby question, but since my application is
> bioinformatics, I'm posting it here.
>
> Just wanted to prepend a few characters in front of FASTA identifiers.
>
> $time cat clustering/dirsForAssembly/singlets.fasta | ruby -pe "gsub(/^>/,
> '>MyPrefix')" > abc
> ? ? ? ?real ? ?0m20.379s
> ? ? ? ?user ? ?0m0.741s
> ? ? ? ?sys ? ? 0m0.168s
>
>
> While the perl equivalent is one heck of a lot faster!!!
>
>
> $time cat clustering/dirsForAssembly/singlets.fasta | perl -p -i -e
> 's/^>/>MyPrefix/g' > ab
> ? ? ? ?real ? ?0m2.165s
> ? ? ? ?user ? ?0m0.266s
> ? ? ? ?sys ? ? 0m0.146s
>
>
> Is there any hope for ruby?

I get a factor of about three on a 10,000,000 line FASTA file:

$ time -p yes ">foo"$'\n'"bar" | head -10000000 | ruby -pe "gsub(/^>/,
'>MyPrefix')" > /dev/null
real 42.99
user 43.39
sys 0.63

$ time -p yes ">foo"$'\n'"bar" | head -10000000 | perl -pe
's/^>/>MyPrefix/g' > /dev/null
real 15.89
user 16.33
sys 0.26

This is with perl 5.8.8 and ruby 1.8.6 on a dual 1.6 GHz CPU with 512 MB RAM.

Notice your user and system times are less than a factor of three.
It's only the real time that is 10x, which suggests that ruby is
waiting on other processes, e.g. disk reads.

Regards,
- Robert


From pjotr.public14 at thebird.nl  Wed Nov  4 05:22:45 2009
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Wed, 4 Nov 2009 11:22:45 +0100
Subject: [BioRuby] Ruby speed
In-Reply-To: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch>
References: <mailman.11.1256227205.8317.bioruby@lists.open-bio.org>
	<22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch>
Message-ID: <20091104102245.GA13264@thebird.nl>

On Tue, Nov 03, 2009 at 03:11:52PM +0100, Yannick Wurm wrote:
> Is there any hope for ruby?

I guess you mean this tongue in cheek. However, it is dangerous as it
may turn off users looking to start with Ruby or Perl. So let me state
I think there is plenty of hope for Ruby. You are talking execution
speed of 'simple' oneliners. For complex programming Ruby outspeeds
Perl, usually in practise. Particularly the speed of getting things
done, but also a cleaner way of programming helps create better code.
The end result will often be faster. And the third gain is in the code
maintenance cycle. I am talking from experience here. I have written
a lot of code in both languages (and Python too).

Perl6 is getting interesting. The syntax is much cleaned up, proper
OOP, and (what I like) strong functional programming support. But its
execution speed is not even close to Ruby's now. I have heard people
joke that Ruby is what Perl6 was meant to be.

Anyway you can see where the Perl folks are heading.

Pj.

P.S. What is there to stop you from using both languages?

From mail at michaelbarton.me.uk  Wed Nov  4 06:24:36 2009
From: mail at michaelbarton.me.uk (Michael Barton)
Date: Wed, 4 Nov 2009 11:24:36 +0000
Subject: [BioRuby] Ruby speed
In-Reply-To: <626B2E8A-8E3B-4643-8104-DC1143121523@unil.ch>
References: <mailman.11.1256227205.8317.bioruby@lists.open-bio.org> 
	<22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch>
	<c27b73c0911031326q7909041cx715927ccc4d487ad@mail.gmail.com> 
	<626B2E8A-8E3B-4643-8104-DC1143121523@unil.ch>
Message-ID: <c27b73c0911040324s66de2a97jde807bcb9fea78b@mail.gmail.com>

2009/11/3 Yannick Wurm <yannick.wurm at unil.ch>:
> thanks for your response. I'm running:
> ruby 1.8.6 (2008-03-03 patchlevel 114) [x86_64-linux]
> Starting to age, but on a production machine I'd rather stay with what works
> than risk breaking things by upgrading them.

I think Ruby 1.9 is now the official Ruby release, so you might want
to start trying out using this version, for example Rails 3.0 won't
work with Ruby 1.8.6 anymore. I've tried Ruby 1.9 a bit myself and the
requirements for compatibility are relatively small. If you still
prefer to use 1.8, you could try using REE
(http://www.rubyenterpriseedition.com/) which has a few patches to
improve performance over vanilla 1.8. You could try using
ruby_switcher which makes trying different ruby versions a bit less
painful - http://bit.ly/1kY1Qk

> the command sed 's/^>/>MyPrefix/' is indeed 30% faster than perl :)

Could you just try calling out to sed then?

> I have the feeling that it's ruby's startup-time especially. Running the
> ruby one-liner my a fasta of 40,000 sequences takes 20 seconds; running it a
> fasta of only 10 lines still takes 13 seconds!!

You might also want to try experimenting with gsub! instead of gsub as
the former does destructive in place substitution while the latter
creates an extra object with the substituted text. This extra object
creation might also slow performance.

Cheers Mike

From diapriid at gmail.com  Wed Nov  4 13:29:13 2009
From: diapriid at gmail.com (Matt)
Date: Wed, 4 Nov 2009 13:29:13 -0500
Subject: [BioRuby] basic remote query and plans for NCBI Bio:Blast hook?
Message-ID: <19d6b9770911041029w1142e68ag1d739675c75514f9@mail.gmail.com>

Hi all,

As far as I can tell there is yet no straightforward way to use
Bio:Blast with the NCBI portal? I've seen this on the wiki: "Add
remote BLAST search sites", and understand the basic concept, but
don't have time at present to work on this.  Is anyone actively
working on this? (just FYI see
http://github.com/kwicher/ruby-blast-at-ncbi).

I ask in part because I'm struggling to get a basic remote blast working:

seq = Bio::Sequence::NA.new('GTCACAAAATCATGGTTTTGCGGTTAATGCTAATGATTTGCCAGCTGATTGGGAACCATTATTTACAAATGCGAACGACAATACAAATGAAGGAATTGTACACAAAACACATCCATTCTTTAGTGTACAATTTCATCCCGAACACACAGCCGGTCCAGAAGATTTAGAAATCTTATTTGATGTCTTTCTGGATGGAGTAAAAGCATTTAAAAATAAGGAAAAGTTCAYCATGAARGATAAATTGATCGAAAAATTGACTTACACGCCGGATGTACCCGTTTGCACTGAAAAACCTAAAAAGATATTGATTTTAGGTTCAGGCGGTTTATCCATAGGYCAAGCAGGCGAATTTGATTATTCCGGATCTCAGGCTATCAAGGCTCTTAAAGAAGAAAAAATACAAACGGTGYTAATAAATCCAAATATTGCAACGGTTCARACATCAAAAGGCCTTGCGGACAAAGTTTACTTCCTACCCATTACACCGGATTACGTTGAACAGGTTATAAAAGCCGAGCGACCTGATGGTGTGCTTTTAACTTTTGGCGGACAAACAGCTTTGAATTGTGGAATTGAATTAGAAAAAACTAAAGTGTTTCAACGATTCGGTGTTAAAGTGTTGGGTACRCCGATACAATCAATTATTGAAACTGAAGATAGAAAAATATTTTCGGATCGAGTACACGAAATCGGAGAAAAAGTAGCGCCGTCTGCCGCAGTTTATTCGGTGCAAGAAGCTCTAGATGCCGCTGAAATTCTTGGTTATCCCGTTATGGCTCGAGCTGCATTTTCATTAGGTGGACTAGGTTCTGGTTTTGCAAATAATATTGATGAATTAAAACATCTTGCACAACAGGCTCTTGCGCATTCCAACCAGTTAATCATTGATAAATCGCTTAAAGGTTGGAAGGAAGTTGAATACGAGGTCGTTCGTGATGCATATGACAATTGTATTACAGTTTGTAATATGGAAAATGTAGATCCACTAGGAATTCATACAGGGGAGAGTATAGTAGTGGCACCGTCACAAACTCTCTCCAACAAGGAATATAATATGTTGCGTACTACAGCAATTAAAGTGATTCGGCATTTTGGCGTCGTCGGTGAATGTAATATACAATATGCCTTAAATCCACATTCYGAGCAATACTATATAATTGAAGTTAATGCTAGGTTATCGAGGAGTTCGGCACTAGCTAGTAAAGCGACAGGCTATCCATTAGCATACGTTGCGGCTAAACTAGCACTCGGTATCGCTTTACCTGATATTAAAAATTCGGTAACTGGAGTTACCACCGCCTGTTTTGAGCCAAGTTTAGATTACTGTGTGGTAAAAATTCCACGATGGGATTTAGCAAAATTTGTTCGCGTTTCAAAAAATATTGGAAGCTCTATGAAAAGTGTAGGTGAGGTCATGGCAATCGGCCGCCGATTTGAAGAAGCGTTCCAAAAA')

blast_factory = Bio::Blast.new('blastn','nr-nt', '', 'genomenet')
foo = blast_factory.query(seq)

... freezes, when I ctrl-C

from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast/genomenet.rb:224:in
`call'
from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast/genomenet.rb:224:in
`sleep'
from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast/genomenet.rb:224:in
`exec_genomenet'
from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast.rb:368:in
`__send__'
from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast.rb:368:in
`query'
from (irb):25

any glaring problems with this? Is it just waiting for the results of
the remote query?   I noticed that the genomenet blasts are much
slower than NCBI in general (I'm in the US).

thanks,
Matt


From diapriid at gmail.com  Wed Nov  4 14:57:11 2009
From: diapriid at gmail.com (Matt)
Date: Wed, 4 Nov 2009 14:57:11 -0500
Subject: [BioRuby] (previous answered in part) timeout/long time
Message-ID: <19d6b9770911041157l1556ac89s4e8c62ad2e20460d@mail.gmail.com>

Aha- my queries *are* working, just taking a very long time to finish.

Can I limit to say top 10 results?

cheers,
Matt

From yannick.wurm at unil.ch  Wed Nov  4 14:56:13 2009
From: yannick.wurm at unil.ch (Yannick Wurm)
Date: Wed, 4 Nov 2009 20:56:13 +0100
Subject: [BioRuby] Ruby speed
Message-ID: <81E8B742-2508-40DF-8E81-07F1C8126839@unil.ch>

> Notice your user and system times are less than a factor of three.
> It's only the real time that is 10x, which suggests that ruby is
> waiting on other processes, e.g. disk reads.

Great point Robert - I hadn't seen that. My guess the difference is  
due to the fact that ruby is only installed in my networked (sfs) home  
dir on the linux server, not on the local machine like perl is. Gotta  
get the sysadmins to install ruby :)

cheers!
yannick


From email2ants at gmail.com  Thu Nov  5 11:22:12 2009
From: email2ants at gmail.com (Anthony Underwood)
Date: Thu, 5 Nov 2009 16:22:12 +0000
Subject: [BioRuby] basic remote query and plans for NCBI Bio:Blast hook?
In-Reply-To: <19d6b9770911041029w1142e68ag1d739675c75514f9@mail.gmail.com>
References: <19d6b9770911041029w1142e68ag1d739675c75514f9@mail.gmail.com>
Message-ID: <86C24368-84E1-4A43-ABBD-A26B998159B2@gmail.com>

Hi Matt

I have done a bit of work to get NCBI blast working within bioruby.


See this gist on github http://gist.github.com/227160

ncbi_blast.rb defines an exec_ncbi class for the Blast class in bioruby
The script ncbi_blast_test.rb illustrates its usage but uses a few  
functions defined in the blast_functions.rb file


essentially the following should work

require 'rubygems'
require 'bio'
require 'ncbi_blast'
ENV['http_proxy'] = "http://proxy_server_ip:port_numer" # use this if  
you are working from behind a proxy and enter ip and port number as  
appropriate

sequence = "ATGAATCCAAATCAGAAAATAATAA........"

factory = Bio::Blast.remote('blastn', 'nr', '', 'ncbi')
blast_report = factory.query(sequence)


blast_report will be a Bio::Blast::Report object which can be parsed  
as described in the bioruby api

The hit definitions are fairly uninformative containing just the  
accessions. This is why I then have to fetch the data fro embl as  
follows

     accession = definition.split("|")[3]
     accession.sub!(/\..+$/, "") # remove version number
     server = Bio::Fetch.new('http://www.ebi.ac.uk/cgi-bin/dbfetch')
     embl_text = server.fetch('embl', accession)
     embl_object = Bio::EMBL.new(embl_text)
     puts embl_object.description


This is still a work in progress but it worked OK for me. Hope it is  
of some use to you.


Anthony


On 4 Nov 2009, at 18:29, Matt wrote:

> Hi all,
>
> As far as I can tell there is yet no straightforward way to use
> Bio:Blast with the NCBI portal? I've seen this on the wiki: "Add
> remote BLAST search sites", and understand the basic concept, but
> don't have time at present to work on this.  Is anyone actively
> working on this? (just FYI see
> http://github.com/kwicher/ruby-blast-at-ncbi).
>
> I ask in part because I'm struggling to get a basic remote blast  
> working:
>
> seq =  
> Bio::Sequence::NA.new('GTCACAAAATCATGGTTTTGCGGTTAATGCTAATGATTTGCCAGCTGATTGGGAACCATTATTTACAAATGCGAACGACAATACAAATGAAGGAATTGTACACAAAACACATCCATTCTTTAGTGTACAATTTCATCCCGAACACACAGCCGGTCCAGAAGATTTAGAAATCTTATTTGATGTCTTTCTGGATGGAGTAAAAGCATTTAAAAATAAGGAAAAGTTCAYCATGAARGATAAATTGATCGAAAAATTGACTTACACGCCGGATGTACCCGTTTGCACTGAAAAACCTAAAAAGATATTGATTTTAGGTTCAGGCGGTTTATCCATAGGYCAAGCAGGCGAATTTGATTATTCCGGATCTCAGGCTATCAAGGCTCTTAAAGAAGAAAAAATACAAACGGTGYTAATAAATCCAAATATTGCAACGGTTCARACATCAAAAGGCCTTGCGGACAAAGTTTACTTCCTACCCATTACACCGGATTACGTTGAACAGGTTATAAAAGCCGAGCGACCTGATGGTGTGCTTTTAACTTTTGGCGGACAAACAGCTTTGAATTGTGGAATTGAATTAGAAAAAACTAAAGTGTTTCAACGATTCGGTGTTAAAGTGTTGGGTACRCCGATACAATCAATTATTGAAACTGAAGATAGAAAAATATTTTCGGATCGAGTACACGAAATCGGAGAAAAAGTAGCGCCGTCTGCCGCAGTTTATTCGGTGCAAGAAGCTCTAGATGCCGCTGAAATTCTTGGTTATCCCGTTATGGCTCGAGCTGCATTTTCATTAGGTGGACTAGGTTCTGGTTTTGCAAATAATATTGATGAATTAAAACATCTTGCACAACAGGCTCTTGCGCATTCCAACCAGTTAATCATTGATAAATCGCTTAAAGGTTGGAAGGAAGTTGAATACGAGGTCGTTCGTGATGCATATGACAATTGTATTACAGT!
> TTGTAATATGGAAAATGTAGATCCACTAGGAATTCATACAGGGGAGAGTATAGTAGTGGCACCGTCACAAACTCTCTCCAACAAGGAATATAATATGTTGCGTACTACAGCAATTAAAGTGATTCGGCATTTTGGCGTCGTCGGTGAATGTAATATACAATATGCCTTAAATCCACATTCYGAGCAATACTATATAATTGAAGTTAATGCTAGGTTATCGAGGAGTTCGGCACTAGCTAGTAAAGCGACAGGCTATCCATTAGCATACGTTGCGGCTAAACTAGCACTCGGTATCGCTTTACCTGATATTAAAAATTCGGTAACTGGAGTTACCACCGCCTGTTTTGAGCCAAGTTTAGATTACTGTGTGGTAAAAATTCCACGATGGGATTTAGCAAAATTTGTTCGCGTTTCAAAAAATATTGGAAGCTCTATGAAAAGTGTAGGTGAGGTCATGGCAATCGGCCGCCGATTTGAAGAAGCGTTCCAAAAA')
>
> blast_factory = Bio::Blast.new('blastn','nr-nt', '', 'genomenet')
> foo = blast_factory.query(seq)
>
> ... freezes, when I ctrl-C
>
> from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast/ 
> genomenet.rb:224:in
> `call'
> from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast/ 
> genomenet.rb:224:in
> `sleep'
> from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast/ 
> genomenet.rb:224:in
> `exec_genomenet'
> from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/ 
> blast.rb:368:in
> `__send__'
> from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/ 
> blast.rb:368:in
> `query'
> from (irb):25
>
> any glaring problems with this? Is it just waiting for the results of
> the remote query?   I noticed that the genomenet blasts are much
> slower than NCBI in general (I'm in the US).
>
> thanks,
> Matt
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From kenglish at gmail.com  Thu Nov  5 11:43:31 2009
From: kenglish at gmail.com (Kevin English)
Date: Thu, 5 Nov 2009 06:43:31 -1000
Subject: [BioRuby] basic remote query and plans for NCBI Bio:Blast hook?
In-Reply-To: <19d6b9770911041029w1142e68ag1d739675c75514f9@mail.gmail.com>
References: <19d6b9770911041029w1142e68ag1d739675c75514f9@mail.gmail.com>
Message-ID: <d82e7f560911050843v508f36c5x504cc8924b62578f@mail.gmail.com>

Have you considered downloading the nr-nt databases and running local queries?

I played with the Blast Remote for a while but determined it was too
slow for our workload...

Kevin

From yannick.wurm at unil.ch  Thu Nov  5 15:06:33 2009
From: yannick.wurm at unil.ch (Yannick Wurm)
Date: Thu, 5 Nov 2009 21:06:33 +0100
Subject: [BioRuby] BioRuby Digest, Vol 50, Issue 1
In-Reply-To: <mailman.19.1257354004.14930.bioruby@lists.open-bio.org>
References: <mailman.19.1257354004.14930.bioruby@lists.open-bio.org>
Message-ID: <D4414522-8174-4599-9F15-350E5FC63D99@unil.ch>

On 4 Nov 2009, at 18:00, bioruby-request at lists.open-bio.org wrote:

> I guess you mean this tongue in cheek. However, it is dangerous as it
> may turn off users looking to start with Ruby or Perl. So let me state
> I think there is plenty of hope for Ruby. You are talking execution
> speed of 'simple' oneliners. For complex programming Ruby outspeeds
> Perl, usually in practise. Particularly the speed of getting things
> done, but also a cleaner way of programming helps create better code.
> The end result will often be faster. And the third gain is in the code
> maintenance cycle. I am talking from experience here. I have written
> a lot of code in both languages (and Python too).

Those are excellent points, Pjotr.


> Perl6 is getting interesting. The syntax is much cleaned up, proper
> OOP, and (what I like) strong functional programming support. But its
> execution speed is not even close to Ruby's now. I have heard people
> joke that Ruby is what Perl6 was meant to be.
>
> Anyway you can see where the Perl folks are heading.

Yes, Damion Conway of Perl Best Practices gave us a small workshop  
recently, and I could help but thinking that Perl6 was an attempt to  
rubify perl :)

> P.S. What is there to stop you from using both languages?

Nothing official. But I already find it difficult to keep the R, bash  
and ruby parts of my brain optimized without mixing in perl and  
others :)

Cheers,
yannick


From rob.syme at gmail.com  Thu Nov  5 21:55:45 2009
From: rob.syme at gmail.com (Rob Syme)
Date: Fri, 6 Nov 2009 10:55:45 +0800
Subject: [BioRuby] Parsing large blastout.xml files
Message-ID: <f2863e380911051855n2f613747ofdb877023179294c@mail.gmail.com>

I'm trying to extract information from a large blast xml file. To parse the
xml file, ruby reads the whole file into memory before looking at each
entry. For large files (2.5GBish) - the memory requirements become severe.

My first approach was to split each query up into its own <BlastOutput> xml
instance, so that

<BlastOutput>
  <BlastOutput_iterations>
    <Iteration>
      <Iteration_hits>
        <Hit></Hit>
        <Hit></Hit>
        <Hit></Hit>
      </Iteration_hits>
    </Iteration>
    <Iteration>
      <Iteration_hits>
        <Hit></Hit>
        <Hit></Hit>
        <Hit></Hit>
      </Iteration_hits>
    </Iteration>
    <Iteration>
      <Iteration_hits>
        <Hit></Hit>
        <Hit></Hit>
        <Hit></Hit>
      </Iteration_hits>
    </Iteration>
  </BlastOutput_iterations>
</BlastOutput>

Would end up looking more like:
<BlastOutput>
  <BlastOutput_iterations>
    <Iteration>
      <Iteration_hits>
        <Hit></Hit>
        <Hit></Hit>
        <Hit></Hit>
      </Iteration_hits>
    </Iteration>
  </BlastOutput_iterations>
</BlastOutput>

<BlastOutput>
  <BlastOutput_iterations>
    <Iteration>
      <Iteration_hits>
        <Hit></Hit>
        <Hit></Hit>
        <Hit></Hit>
      </Iteration_hits>
    </Iteration>
  </BlastOutput_iterations>
</BlastOutput>

<BlastOutput>
  <BlastOutput_iterations>
    <Iteration>
      <Iteration_hits>
        <Hit></Hit>
        <Hit></Hit>
        <Hit></Hit>
      </Iteration_hits>
    </Iteration>
  </BlastOutput_iterations>
</BlastOutput>

Which bioruby has trouble parsing, so the <BlastOutput>s had to be given
their own file:

$ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}'

Now each file can be parsed individually. I feel like there has to be an
easier way. Is there a way to parse large xml files without huge memory
overheads, or is that just par for the course?

From rozziite at gmail.com  Thu Nov  5 22:11:32 2009
From: rozziite at gmail.com (Diana Jaunzeikare)
Date: Thu, 5 Nov 2009 22:11:32 -0500
Subject: [BioRuby] Parsing large blastout.xml files
In-Reply-To: <f2863e380911051855n2f613747ofdb877023179294c@mail.gmail.com>
References: <f2863e380911051855n2f613747ofdb877023179294c@mail.gmail.com>
Message-ID: <4057d3bf0911051911v1b98a2c7ka17392afb6204e37@mail.gmail.com>

Another option is to use ruby-libxml reader.
http://libxml.rubyforge.org/rdoc/index.html  It reads the data
sequentially thus there is no memory overhead of first reading it all
in memory. However, then you would have to parse it from scratch.

On that note, maybe it is worth implementing Bio::Blast::Report.libxml
or something like that the same way there is Bio::Blast::Report.rexml
and Bio::Blast::Report.xmlparser. Dependecy to ruby-libxml in BioRuby
library was introducted in PhyloXML parser.

Diana

On Thu, Nov 5, 2009 at 9:55 PM, Rob Syme <rob.syme at gmail.com> wrote:
> I'm trying to extract information from a large blast xml file. To parse the
> xml file, ruby reads the whole file into memory before looking at each
> entry. For large files (2.5GBish) - the memory requirements become severe.
>
> My first approach was to split each query up into its own <BlastOutput> xml
> instance, so that
>
> <BlastOutput>
> ?<BlastOutput_iterations>
> ? ?<Iteration>
> ? ? ?<Iteration_hits>
> ? ? ? ?<Hit></Hit>
> ? ? ? ?<Hit></Hit>
> ? ? ? ?<Hit></Hit>
> ? ? ?</Iteration_hits>
> ? ?</Iteration>
> ? ?<Iteration>
> ? ? ?<Iteration_hits>
> ? ? ? ?<Hit></Hit>
> ? ? ? ?<Hit></Hit>
> ? ? ? ?<Hit></Hit>
> ? ? ?</Iteration_hits>
> ? ?</Iteration>
> ? ?<Iteration>
> ? ? ?<Iteration_hits>
> ? ? ? ?<Hit></Hit>
> ? ? ? ?<Hit></Hit>
> ? ? ? ?<Hit></Hit>
> ? ? ?</Iteration_hits>
> ? ?</Iteration>
> ?</BlastOutput_iterations>
> </BlastOutput>
>
> Would end up looking more like:
> <BlastOutput>
> ?<BlastOutput_iterations>
> ? ?<Iteration>
> ? ? ?<Iteration_hits>
> ? ? ? ?<Hit></Hit>
> ? ? ? ?<Hit></Hit>
> ? ? ? ?<Hit></Hit>
> ? ? ?</Iteration_hits>
> ? ?</Iteration>
> ?</BlastOutput_iterations>
> </BlastOutput>
>
> <BlastOutput>
> ?<BlastOutput_iterations>
> ? ?<Iteration>
> ? ? ?<Iteration_hits>
> ? ? ? ?<Hit></Hit>
> ? ? ? ?<Hit></Hit>
> ? ? ? ?<Hit></Hit>
> ? ? ?</Iteration_hits>
> ? ?</Iteration>
> ?</BlastOutput_iterations>
> </BlastOutput>
>
> <BlastOutput>
> ?<BlastOutput_iterations>
> ? ?<Iteration>
> ? ? ?<Iteration_hits>
> ? ? ? ?<Hit></Hit>
> ? ? ? ?<Hit></Hit>
> ? ? ? ?<Hit></Hit>
> ? ? ?</Iteration_hits>
> ? ?</Iteration>
> ?</BlastOutput_iterations>
> </BlastOutput>
>
> Which bioruby has trouble parsing, so the <BlastOutput>s had to be given
> their own file:
>
> $ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}'
>
> Now each file can be parsed individually. I feel like there has to be an
> easier way. Is there a way to parse large xml files without huge memory
> overheads, or is that just par for the course?
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>


From adamnkraut at gmail.com  Thu Nov  5 22:17:02 2009
From: adamnkraut at gmail.com (Adam)
Date: Thu, 5 Nov 2009 22:17:02 -0500
Subject: [BioRuby] Parsing large blastout.xml files
In-Reply-To: <f2863e380911051855n2f613747ofdb877023179294c@mail.gmail.com>
References: <f2863e380911051855n2f613747ofdb877023179294c@mail.gmail.com>
Message-ID: <134ede0b0911051917nf0877e0y8df95c3147a24d07@mail.gmail.com>

You might want to try a SAX Parser instead.

REXML from the standard library has a streaming API.  LibXML is a lot faster
and it's available as a gem.

http://libxml.rubyforge.org/

On Thu, Nov 5, 2009 at 9:55 PM, Rob Syme <rob.syme at gmail.com> wrote:

> I'm trying to extract information from a large blast xml file. To parse the
> xml file, ruby reads the whole file into memory before looking at each
> entry. For large files (2.5GBish) - the memory requirements become severe.
>
> My first approach was to split each query up into its own <BlastOutput> xml
> instance, so that
>
> <BlastOutput>
>  <BlastOutput_iterations>
>    <Iteration>
>      <Iteration_hits>
>        <Hit></Hit>
>        <Hit></Hit>
>        <Hit></Hit>
>      </Iteration_hits>
>    </Iteration>
>    <Iteration>
>      <Iteration_hits>
>        <Hit></Hit>
>        <Hit></Hit>
>        <Hit></Hit>
>      </Iteration_hits>
>    </Iteration>
>    <Iteration>
>      <Iteration_hits>
>        <Hit></Hit>
>        <Hit></Hit>
>        <Hit></Hit>
>      </Iteration_hits>
>    </Iteration>
>  </BlastOutput_iterations>
> </BlastOutput>
>
> Would end up looking more like:
> <BlastOutput>
>  <BlastOutput_iterations>
>    <Iteration>
>      <Iteration_hits>
>        <Hit></Hit>
>        <Hit></Hit>
>        <Hit></Hit>
>      </Iteration_hits>
>    </Iteration>
>  </BlastOutput_iterations>
> </BlastOutput>
>
> <BlastOutput>
>  <BlastOutput_iterations>
>    <Iteration>
>      <Iteration_hits>
>        <Hit></Hit>
>        <Hit></Hit>
>        <Hit></Hit>
>      </Iteration_hits>
>    </Iteration>
>  </BlastOutput_iterations>
> </BlastOutput>
>
> <BlastOutput>
>  <BlastOutput_iterations>
>    <Iteration>
>      <Iteration_hits>
>        <Hit></Hit>
>        <Hit></Hit>
>        <Hit></Hit>
>      </Iteration_hits>
>    </Iteration>
>  </BlastOutput_iterations>
> </BlastOutput>
>
> Which bioruby has trouble parsing, so the <BlastOutput>s had to be given
> their own file:
>
> $ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}'
>
> Now each file can be parsed individually. I feel like there has to be an
> easier way. Is there a way to parse large xml files without huge memory
> overheads, or is that just par for the course?
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>

From pjotr.public14 at thebird.nl  Fri Nov  6 03:58:15 2009
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Fri, 6 Nov 2009 09:58:15 +0100
Subject: [BioRuby] Parsing large blastout.xml files
In-Reply-To: <4057d3bf0911051911v1b98a2c7ka17392afb6204e37@mail.gmail.com>
References: <f2863e380911051855n2f613747ofdb877023179294c@mail.gmail.com>
	<4057d3bf0911051911v1b98a2c7ka17392afb6204e37@mail.gmail.com>
Message-ID: <20091106085815.GA12244@thebird.nl>

Diana is right. We need to revamp the implementation for big results.
Not only that, the current implementation has method names do not
match the BLAST names. I need something like this pretty soon and
was thinking of writing it.

Pj.

On Thu, Nov 05, 2009 at 10:11:32PM -0500, Diana Jaunzeikare wrote:
> Another option is to use ruby-libxml reader.
> http://libxml.rubyforge.org/rdoc/index.html  It reads the data
> sequentially thus there is no memory overhead of first reading it all
> in memory. However, then you would have to parse it from scratch.
> 
> On that note, maybe it is worth implementing Bio::Blast::Report.libxml
> or something like that the same way there is Bio::Blast::Report.rexml
> and Bio::Blast::Report.xmlparser. Dependecy to ruby-libxml in BioRuby
> library was introducted in PhyloXML parser.
> 
> Diana
> 
> On Thu, Nov 5, 2009 at 9:55 PM, Rob Syme <rob.syme at gmail.com> wrote:
> > I'm trying to extract information from a large blast xml file. To parse the
> > xml file, ruby reads the whole file into memory before looking at each
> > entry. For large files (2.5GBish) - the memory requirements become severe.
> >
> > My first approach was to split each query up into its own <BlastOutput> xml
> > instance, so that
> >
> > <BlastOutput>
> > ?<BlastOutput_iterations>
> > ? ?<Iteration>
> > ? ? ?<Iteration_hits>
> > ? ? ? ?<Hit></Hit>
> > ? ? ? ?<Hit></Hit>
> > ? ? ? ?<Hit></Hit>
> > ? ? ?</Iteration_hits>
> > ? ?</Iteration>
> > ? ?<Iteration>
> > ? ? ?<Iteration_hits>
> > ? ? ? ?<Hit></Hit>
> > ? ? ? ?<Hit></Hit>
> > ? ? ? ?<Hit></Hit>
> > ? ? ?</Iteration_hits>
> > ? ?</Iteration>
> > ? ?<Iteration>
> > ? ? ?<Iteration_hits>
> > ? ? ? ?<Hit></Hit>
> > ? ? ? ?<Hit></Hit>
> > ? ? ? ?<Hit></Hit>
> > ? ? ?</Iteration_hits>
> > ? ?</Iteration>
> > ?</BlastOutput_iterations>
> > </BlastOutput>
> >
> > Would end up looking more like:
> > <BlastOutput>
> > ?<BlastOutput_iterations>
> > ? ?<Iteration>
> > ? ? ?<Iteration_hits>
> > ? ? ? ?<Hit></Hit>
> > ? ? ? ?<Hit></Hit>
> > ? ? ? ?<Hit></Hit>
> > ? ? ?</Iteration_hits>
> > ? ?</Iteration>
> > ?</BlastOutput_iterations>
> > </BlastOutput>
> >
> > <BlastOutput>
> > ?<BlastOutput_iterations>
> > ? ?<Iteration>
> > ? ? ?<Iteration_hits>
> > ? ? ? ?<Hit></Hit>
> > ? ? ? ?<Hit></Hit>
> > ? ? ? ?<Hit></Hit>
> > ? ? ?</Iteration_hits>
> > ? ?</Iteration>
> > ?</BlastOutput_iterations>
> > </BlastOutput>
> >
> > <BlastOutput>
> > ?<BlastOutput_iterations>
> > ? ?<Iteration>
> > ? ? ?<Iteration_hits>
> > ? ? ? ?<Hit></Hit>
> > ? ? ? ?<Hit></Hit>
> > ? ? ? ?<Hit></Hit>
> > ? ? ?</Iteration_hits>
> > ? ?</Iteration>
> > ?</BlastOutput_iterations>
> > </BlastOutput>
> >
> > Which bioruby has trouble parsing, so the <BlastOutput>s had to be given
> > their own file:
> >
> > $ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}'
> >
> > Now each file can be parsed individually. I feel like there has to be an
> > easier way. Is there a way to parse large xml files without huge memory
> > overheads, or is that just par for the course?
> > _______________________________________________
> > BioRuby mailing list
> > BioRuby at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioruby
> >
> 
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby

From pjotr.public14 at thebird.nl  Sat Nov  7 02:42:44 2009
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Sat, 7 Nov 2009 08:42:44 +0100
Subject: [BioRuby] Parsing large blastout.xml files
In-Reply-To: <f2863e380911051855n2f613747ofdb877023179294c@mail.gmail.com>
References: <f2863e380911051855n2f613747ofdb877023179294c@mail.gmail.com>
Message-ID: <20091107074244.GA22748@thebird.nl>

I did the same a while back using xmltwig:

  http://github.com/pjotrp/biotools/blob/master/bin/blast_split_xml

On Fri, Nov 06, 2009 at 10:55:45AM +0800, Rob Syme wrote:
> I'm trying to extract information from a large blast xml file. To parse the
> xml file, ruby reads the whole file into memory before looking at each
> entry. For large files (2.5GBish) - the memory requirements become severe.
> 
> My first approach was to split each query up into its own <BlastOutput> xml
> instance, so that
> 
> <BlastOutput>
>   <BlastOutput_iterations>
>     <Iteration>
>       <Iteration_hits>
>         <Hit></Hit>
>         <Hit></Hit>
>         <Hit></Hit>
>       </Iteration_hits>
>     </Iteration>
>     <Iteration>
>       <Iteration_hits>
>         <Hit></Hit>
>         <Hit></Hit>
>         <Hit></Hit>
>       </Iteration_hits>
>     </Iteration>
>     <Iteration>
>       <Iteration_hits>
>         <Hit></Hit>
>         <Hit></Hit>
>         <Hit></Hit>
>       </Iteration_hits>
>     </Iteration>
>   </BlastOutput_iterations>
> </BlastOutput>
> 
> Would end up looking more like:
> <BlastOutput>
>   <BlastOutput_iterations>
>     <Iteration>
>       <Iteration_hits>
>         <Hit></Hit>
>         <Hit></Hit>
>         <Hit></Hit>
>       </Iteration_hits>
>     </Iteration>
>   </BlastOutput_iterations>
> </BlastOutput>
> 
> <BlastOutput>
>   <BlastOutput_iterations>
>     <Iteration>
>       <Iteration_hits>
>         <Hit></Hit>
>         <Hit></Hit>
>         <Hit></Hit>
>       </Iteration_hits>
>     </Iteration>
>   </BlastOutput_iterations>
> </BlastOutput>
> 
> <BlastOutput>
>   <BlastOutput_iterations>
>     <Iteration>
>       <Iteration_hits>
>         <Hit></Hit>
>         <Hit></Hit>
>         <Hit></Hit>
>       </Iteration_hits>
>     </Iteration>
>   </BlastOutput_iterations>
> </BlastOutput>
> 
> Which bioruby has trouble parsing, so the <BlastOutput>s had to be given
> their own file:
> 
> $ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}'
> 
> Now each file can be parsed individually. I feel like there has to be an
> easier way. Is there a way to parse large xml files without huge memory
> overheads, or is that just par for the course?
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby

From djaunzei at smith.edu  Sat Nov  7 22:50:26 2009
From: djaunzei at smith.edu (Diana Jaunzeikare)
Date: Sat, 7 Nov 2009 22:50:26 -0500
Subject: [BioRuby] BioRuby Phyloxml update
Message-ID: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com>

Hi all,

So finally I have updated Bio::Tree and Bio::Node classes to improve
the phyloxml writer speed.

* Added Bio::Node::parent and  Bio::Node::children (array of nodes) in
order to avoid calling Tree::parent(node) or Tree::children(node),
because those methods call breath first search on the underlying
graph, which makes PhyloXML writer and parser incredibly slow. In
contrast, Bio::Node::parent and Bio::Node::children keeps references
to the respective nodes.
* Updated  Tree::add_edge, Tree::clear_edge, Tree::remove_edge,
Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep
track of Node::parent and Node::children nodes correctly.  Have I
forgotten anything?
* Now for PhyloXML writer it takes less than 1 second instead of
~20minutes to write ncbi_taxonomy_mollusca.xml file 1.5MB
* To write the tree of life taxonomy file (~46MB) it takes 10 seconds
(On 2.4GHz, 2.9GB RAM, running Ubuntu)

The code is in http://github.com/latvianlinuxgirl/bioruby/tree/tree_class

I wrote unit tests for my changes and made sure my changes don't break
anything else. However, does anybody has code laying around that uses
Tree::parent and Tree::children methods so that I can test it more
thoroughly?

Cheers,
Diana

From ngoto at gen-info.osaka-u.ac.jp  Sun Nov  8 07:50:56 2009
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa Goto)
Date: Sun, 08 Nov 2009 21:50:56 +0900
Subject: [BioRuby] BioRuby Phyloxml update
In-Reply-To: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com>
References: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com>
Message-ID: <20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp>

Hi Diana,

I'm sorry that the changes cannot be accepted, because the
modification of existing Bio::Tree methods breaks things.
Bio::Tree does not want to have children/parent information
in nodes. One of the reasons is that it is difficult to keep
consistency when copying a tree. Nodes can be shared with two
or more trees when copying a tree by using "dup" or "clone"
method.

Normally, tests for existing classes shold not be modified
except when changing specification or the test's bug, because
they guarantee specification of the class. Adding new tests
are OK.

If you really want nodes to have parent/children information
in each node, please do so in only PhyloXML classes (though
I'm negative).  In this case, the problem is that reading phyloxml
data and write back again seems good, but it seems there are
currently no way to convert Bio::Tree to PhyloXML. Now, it seems
hard to convert Newick data to PhyloXML.

Now, to prepare to include your PhyloXML code in BioRuby, I'm working
on my branch. Some API changes will be made.
http://github.com/ngoto/bioruby/tree/incoming

Note that in your test code, argument order of assert_equal is wrong.
I've already fixed in my branch.
http://github.com/ngoto/bioruby/commit/a291af62ef262ee04f3a0e1b6415d4e256c56a94

> * Updated  Tree::add_edge, Tree::clear_edge, Tree::remove_edge,
> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep
> track of Node::parent and Node::children nodes correctly.  Have I
> forgotten anything?

Changing root with tree.root=().

-- 
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org


> Hi all,
> 
> So finally I have updated Bio::Tree and Bio::Node classes to improve
> the phyloxml writer speed.
> 
> * Added Bio::Node::parent and  Bio::Node::children (array of nodes) in
> order to avoid calling Tree::parent(node) or Tree::children(node),
> because those methods call breath first search on the underlying
> graph, which makes PhyloXML writer and parser incredibly slow. In
> contrast, Bio::Node::parent and Bio::Node::children keeps references
> to the respective nodes.
> * Updated  Tree::add_edge, Tree::clear_edge, Tree::remove_edge,
> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep
> track of Node::parent and Node::children nodes correctly.  Have I
> forgotten anything?
> * Now for PhyloXML writer it takes less than 1 second instead of
> ~20minutes to write ncbi_taxonomy_mollusca.xml file 1.5MB
> * To write the tree of life taxonomy file (~46MB) it takes 10 seconds
> (On 2.4GHz, 2.9GB RAM, running Ubuntu)
> 
> The code is in http://github.com/latvianlinuxgirl/bioruby/tree/tree_class
> 
> I wrote unit tests for my changes and made sure my changes don't break
> anything else. However, does anybody has code laying around that uses
> Tree::parent and Tree::children methods so that I can test it more
> thoroughly?
> 
> Cheers,
> Diana
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From jan.aerts at gmail.com  Mon Nov 16 05:11:24 2009
From: jan.aerts at gmail.com (Jan Aerts)
Date: Mon, 16 Nov 2009 10:11:24 +0000
Subject: [BioRuby] BioRuby Phyloxml update
In-Reply-To: <20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp>
References: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com>
	<20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp>
Message-ID: <4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9@mail.gmail.com>

All,

I think we should make a good effort of merging Diana's code into the
bioruby codebase. Even though I'm not completely familiar with
bioruby's phylo implementation, an effort like hers should be welcomed
with open arms.

If her code speeds things up so immensely, why don't we start a new
branch that will lead to bioruby 2.0? Let bioruby 2.0 break things.
With a major new release things are allowed to be broken free from the
legacy code.

We definitely don't want Diana's efforts be in vain.

jan.

2009/11/8 Naohisa Goto <ngoto at gen-info.osaka-u.ac.jp>:
> Hi Diana,
>
> I'm sorry that the changes cannot be accepted, because the
> modification of existing Bio::Tree methods breaks things.
> Bio::Tree does not want to have children/parent information
> in nodes. One of the reasons is that it is difficult to keep
> consistency when copying a tree. Nodes can be shared with two
> or more trees when copying a tree by using "dup" or "clone"
> method.
>
> Normally, tests for existing classes shold not be modified
> except when changing specification or the test's bug, because
> they guarantee specification of the class. Adding new tests
> are OK.
>
> If you really want nodes to have parent/children information
> in each node, please do so in only PhyloXML classes (though
> I'm negative). ?In this case, the problem is that reading phyloxml
> data and write back again seems good, but it seems there are
> currently no way to convert Bio::Tree to PhyloXML. Now, it seems
> hard to convert Newick data to PhyloXML.
>
> Now, to prepare to include your PhyloXML code in BioRuby, I'm working
> on my branch. Some API changes will be made.
> http://github.com/ngoto/bioruby/tree/incoming
>
> Note that in your test code, argument order of assert_equal is wrong.
> I've already fixed in my branch.
> http://github.com/ngoto/bioruby/commit/a291af62ef262ee04f3a0e1b6415d4e256c56a94
>
>> * Updated ?Tree::add_edge, Tree::clear_edge, Tree::remove_edge,
>> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep
>> track of Node::parent and Node::children nodes correctly. ?Have I
>> forgotten anything?
>
> Changing root with tree.root=().
>
> --
> Naohisa Goto
> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
>
>
>> Hi all,
>>
>> So finally I have updated Bio::Tree and Bio::Node classes to improve
>> the phyloxml writer speed.
>>
>> * Added Bio::Node::parent and ?Bio::Node::children (array of nodes) in
>> order to avoid calling Tree::parent(node) or Tree::children(node),
>> because those methods call breath first search on the underlying
>> graph, which makes PhyloXML writer and parser incredibly slow. In
>> contrast, Bio::Node::parent and Bio::Node::children keeps references
>> to the respective nodes.
>> * Updated ?Tree::add_edge, Tree::clear_edge, Tree::remove_edge,
>> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep
>> track of Node::parent and Node::children nodes correctly. ?Have I
>> forgotten anything?
>> * Now for PhyloXML writer it takes less than 1 second instead of
>> ~20minutes to write ncbi_taxonomy_mollusca.xml file 1.5MB
>> * To write the tree of life taxonomy file (~46MB) it takes 10 seconds
>> (On 2.4GHz, 2.9GB RAM, running Ubuntu)
>>
>> The code is in http://github.com/latvianlinuxgirl/bioruby/tree/tree_class
>>
>> I wrote unit tests for my changes and made sure my changes don't break
>> anything else. However, does anybody has code laying around that uses
>> Tree::parent and Tree::children methods so that I can test it more
>> thoroughly?
>>
>> Cheers,
>> Diana
>> _______________________________________________
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>


From georgkam at gmail.com  Tue Nov 17 00:40:31 2009
From: georgkam at gmail.com (George Githinji)
Date: Tue, 17 Nov 2009 08:40:31 +0300
Subject: [BioRuby] BioRuby Digest, Vol 50, Issue 6
In-Reply-To: <mailman.15.1258390804.12843.bioruby@lists.open-bio.org>
References: <mailman.15.1258390804.12843.bioruby@lists.open-bio.org>
Message-ID: <55915f820911162140w592077f4o448d63e11b4300be@mail.gmail.com>

If Ruby itself is known to be slow compared to other interpreters, and
Diana;s code speeds up things, as a Bioruby user i would plead with the
developers to adopt her code in the next release with the speed
optimizations. The next release can only be better if the current code base
is overhauled and reviewed based on new developments like Diana's.

If Newick can be converted to a format which can then be converted to
PhyloXML, then conversion to newick is not a problem. Else I would question
the use of Newick format if it cannot be inter-converted to other file
formats.


On Mon, Nov 16, 2009 at 8:00 PM, <bioruby-request at lists.open-bio.org> wrote:

> Send BioRuby mailing list submissions to
>        bioruby at lists.open-bio.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://lists.open-bio.org/mailman/listinfo/bioruby
> or, via email, send a message with subject or body 'help' to
>        bioruby-request at lists.open-bio.org
>
> You can reach the person managing the list at
>        bioruby-owner at lists.open-bio.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of BioRuby digest..."
>
>
> Today's Topics:
>
>   1. Re: BioRuby Phyloxml update (Jan Aerts)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 16 Nov 2009 10:11:24 +0000
> From: Jan Aerts <jan.aerts at gmail.com>
> Subject: Re: [BioRuby] BioRuby Phyloxml update
> To: Naohisa Goto <ngoto at gen-info.osaka-u.ac.jp>
> Cc: phyloxml at yahoogroups.com, Pjotr Prins <pjotr2009 at thebird.nl>,
>        bioruby at lists.open-bio.org, Diana Jaunzeikare <djaunzei at smith.edu>
> Message-ID:
>        <4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> All,
>
> I think we should make a good effort of merging Diana's code into the
> bioruby codebase. Even though I'm not completely familiar with
> bioruby's phylo implementation, an effort like hers should be welcomed
> with open arms.
>
> If her code speeds things up so immensely, why don't we start a new
> branch that will lead to bioruby 2.0? Let bioruby 2.0 break things.
> With a major new release things are allowed to be broken free from the
> legacy code.
>
> We definitely don't want Diana's efforts be in vain.
>
> jan.
>
> 2009/11/8 Naohisa Goto <ngoto at gen-info.osaka-u.ac.jp>:
> > Hi Diana,
> >
> > I'm sorry that the changes cannot be accepted, because the
> > modification of existing Bio::Tree methods breaks things.
> > Bio::Tree does not want to have children/parent information
> > in nodes. One of the reasons is that it is difficult to keep
> > consistency when copying a tree. Nodes can be shared with two
> > or more trees when copying a tree by using "dup" or "clone"
> > method.
> >
> > Normally, tests for existing classes shold not be modified
> > except when changing specification or the test's bug, because
> > they guarantee specification of the class. Adding new tests
> > are OK.
> >
> > If you really want nodes to have parent/children information
> > in each node, please do so in only PhyloXML classes (though
> > I'm negative). ?In this case, the problem is that reading phyloxml
> > data and write back again seems good, but it seems there are
> > currently no way to convert Bio::Tree to PhyloXML. Now, it seems
> > hard to convert Newick data to PhyloXML.
> >
> > Now, to prepare to include your PhyloXML code in BioRuby, I'm working
> > on my branch. Some API changes will be made.
> > http://github.com/ngoto/bioruby/tree/incoming
> >
> > Note that in your test code, argument order of assert_equal is wrong.
> > I've already fixed in my branch.
> >
> http://github.com/ngoto/bioruby/commit/a291af62ef262ee04f3a0e1b6415d4e256c56a94
> >
> >> * Updated ?Tree::add_edge, Tree::clear_edge, Tree::remove_edge,
> >> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep
> >> track of Node::parent and Node::children nodes correctly. ?Have I
> >> forgotten anything?
> >
> > Changing root with tree.root=().
> >
> > --
> > Naohisa Goto
> > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
> >
> >
> >> Hi all,
> >>
> >> So finally I have updated Bio::Tree and Bio::Node classes to improve
> >> the phyloxml writer speed.
> >>
> >> * Added Bio::Node::parent and ?Bio::Node::children (array of nodes) in
> >> order to avoid calling Tree::parent(node) or Tree::children(node),
> >> because those methods call breath first search on the underlying
> >> graph, which makes PhyloXML writer and parser incredibly slow. In
> >> contrast, Bio::Node::parent and Bio::Node::children keeps references
> >> to the respective nodes.
> >> * Updated ?Tree::add_edge, Tree::clear_edge, Tree::remove_edge,
> >> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep
> >> track of Node::parent and Node::children nodes correctly. ?Have I
> >> forgotten anything?
> >> * Now for PhyloXML writer it takes less than 1 second instead of
> >> ~20minutes to write ncbi_taxonomy_mollusca.xml file 1.5MB
> >> * To write the tree of life taxonomy file (~46MB) it takes 10 seconds
> >> (On 2.4GHz, 2.9GB RAM, running Ubuntu)
> >>
> >> The code is in
> http://github.com/latvianlinuxgirl/bioruby/tree/tree_class
> >>
> >> I wrote unit tests for my changes and made sure my changes don't break
> >> anything else. However, does anybody has code laying around that uses
> >> Tree::parent and Tree::children methods so that I can test it more
> >> thoroughly?
> >>
> >> Cheers,
> >> Diana
> >> _______________________________________________
> >> BioRuby mailing list
> >> BioRuby at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioruby
> >
> > _______________________________________________
> > BioRuby mailing list
> > BioRuby at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioruby
> >
>
>
>
> ------------------------------
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>
>
> End of BioRuby Digest, Vol 50, Issue 6
> **************************************
>


-- 
---------------
Sincerely
George

Skype: george_g2
Blog: http://biorelated.wordpress.com/

From djaunzei at smith.edu  Tue Nov 17 09:52:59 2009
From: djaunzei at smith.edu (Diana Jaunzeikare)
Date: Tue, 17 Nov 2009 09:52:59 -0500
Subject: [BioRuby] BioRuby Phyloxml update
In-Reply-To: <4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9@mail.gmail.com>
References: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com> 
	<20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp>
	<4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9@mail.gmail.com>
Message-ID: <4057d3bf0911170652o3683768ax366df83afcfa48e0@mail.gmail.com>

Thanks for discussion. I see Naohisa's point that it is difficult to
keep consistency when copying a tree.

Right now PhyloXML class inherits from Bio::Tree class. Instead, I
could write a new general Bio::FamilyTree class (per Pjotr's
suggestion), which would be strictly a tree (I believe that Bio::Tree
allows for a node to have 2 parents) and would have parent/child
information. Thus it would not need underlying general graph
implementation, therefore making the implementation simpler than that
of Bio::Tree. Then PhyloXML::Tree would inherit from Bio::FamilyTree.
This way PhyloXML writer probably would be even faster because it
would not need to update Bio::Pathway structure (which is under
Bio::Tree) every time adding a node or edge.
Additionally, I think BioRuby would benefit from general
Bio::FamilyTree class. I recently heard a talk by researcher who did
phylogenetic analysis of musical rhythms.

Also I will write method to convert from newick to PhyloXML.

What do you think?

Cheers,
Diana

On Mon, Nov 16, 2009 at 5:11 AM, Jan Aerts <jan.aerts at gmail.com> wrote:
> All,
>
> I think we should make a good effort of merging Diana's code into the
> bioruby codebase. Even though I'm not completely familiar with
> bioruby's phylo implementation, an effort like hers should be welcomed
> with open arms.
>
> If her code speeds things up so immensely, why don't we start a new
> branch that will lead to bioruby 2.0? Let bioruby 2.0 break things.
> With a major new release things are allowed to be broken free from the
> legacy code.
>
> We definitely don't want Diana's efforts be in vain.
>
> jan.
>
> 2009/11/8 Naohisa Goto <ngoto at gen-info.osaka-u.ac.jp>:
>> Hi Diana,
>>
>> I'm sorry that the changes cannot be accepted, because the
>> modification of existing Bio::Tree methods breaks things.
>> Bio::Tree does not want to have children/parent information
>> in nodes. One of the reasons is that it is difficult to keep
>> consistency when copying a tree. Nodes can be shared with two
>> or more trees when copying a tree by using "dup" or "clone"
>> method.
>>
>> Normally, tests for existing classes shold not be modified
>> except when changing specification or the test's bug, because
>> they guarantee specification of the class. Adding new tests
>> are OK.
>>
>> If you really want nodes to have parent/children information
>> in each node, please do so in only PhyloXML classes (though
>> I'm negative). ?In this case, the problem is that reading phyloxml
>> data and write back again seems good, but it seems there are
>> currently no way to convert Bio::Tree to PhyloXML. Now, it seems
>> hard to convert Newick data to PhyloXML.
>>
>> Now, to prepare to include your PhyloXML code in BioRuby, I'm working
>> on my branch. Some API changes will be made.
>> http://github.com/ngoto/bioruby/tree/incoming
>>
>> Note that in your test code, argument order of assert_equal is wrong.
>> I've already fixed in my branch.
>> http://github.com/ngoto/bioruby/commit/a291af62ef262ee04f3a0e1b6415d4e256c56a94
>>
>>> * Updated ?Tree::add_edge, Tree::clear_edge, Tree::remove_edge,
>>> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep
>>> track of Node::parent and Node::children nodes correctly. ?Have I
>>> forgotten anything?
>>
>> Changing root with tree.root=().
>>
>> --
>> Naohisa Goto
>> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
>>
>>
>>> Hi all,
>>>
>>> So finally I have updated Bio::Tree and Bio::Node classes to improve
>>> the phyloxml writer speed.
>>>
>>> * Added Bio::Node::parent and ?Bio::Node::children (array of nodes) in
>>> order to avoid calling Tree::parent(node) or Tree::children(node),
>>> because those methods call breath first search on the underlying
>>> graph, which makes PhyloXML writer and parser incredibly slow. In
>>> contrast, Bio::Node::parent and Bio::Node::children keeps references
>>> to the respective nodes.
>>> * Updated ?Tree::add_edge, Tree::clear_edge, Tree::remove_edge,
>>> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep
>>> track of Node::parent and Node::children nodes correctly. ?Have I
>>> forgotten anything?
>>> * Now for PhyloXML writer it takes less than 1 second instead of
>>> ~20minutes to write ncbi_taxonomy_mollusca.xml file 1.5MB
>>> * To write the tree of life taxonomy file (~46MB) it takes 10 seconds
>>> (On 2.4GHz, 2.9GB RAM, running Ubuntu)
>>>
>>> The code is in http://github.com/latvianlinuxgirl/bioruby/tree/tree_class
>>>
>>> I wrote unit tests for my changes and made sure my changes don't break
>>> anything else. However, does anybody has code laying around that uses
>>> Tree::parent and Tree::children methods so that I can test it more
>>> thoroughly?
>>>
>>> Cheers,
>>> Diana
>>> _______________________________________________
>>> BioRuby mailing list
>>> BioRuby at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioruby
>>
>> _______________________________________________
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
>>
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>


From ngoto at gen-info.osaka-u.ac.jp  Tue Nov 17 11:27:46 2009
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Wed, 18 Nov 2009 01:27:46 +0900
Subject: [BioRuby] BioRuby Phyloxml update
In-Reply-To: <4057d3bf0911170652o3683768ax366df83afcfa48e0@mail.gmail.com>
References: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com>
	<20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp>
	<4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9@mail.gmail.com>
	<4057d3bf0911170652o3683768ax366df83afcfa48e0@mail.gmail.com>
Message-ID: <20091117162747.C1CD81CBC41B@idnmail.gen-info.osaka-u.ac.jp>

Hi,

I've just committed speed-up of Bio::Tree#children in my repository.
It keeps compatibility. Trade-off for the speed-up, memory consumption
is a little bit larger than the previous code.
http://github.com/ngoto/bioruby

For the benchmark of reading and writing big PhyloXML code, based
on Diana's test_phyloxml_big.rb, a new sample code is added
as sample/test_phyloxml_big.rb.

Running the new sample/test_phyloxml_big.rb on a machine
(Pentium D 3.40GHz, memory 4GB, running Debian GNU/Linux)
with http://github.com/ngoto/bioruby:
47.52user 0.93system 0:50.09elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+141424outputs (0major+167550minor)pagefaults 0swaps

with http://github.com/latvianlinuxgirl/bioruby/tree/tree_class
43.55user 1.00system 0:46.59elapsed 95%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+141424outputs (0major+165151minor)pagefaults 0swaps

Although my new code is still ~10% slower than Diana's new code,
I think it can be acceptable because my code keeps compatibility.

I wrote Bio::Tree because I want to manipulate trees flexibly,
e.g. merging and splitting trees, changing root of trees.
For the purpose, I didn't take the way to have parent/children
in a node.

I also think the current Bio::Tree is not the best. One of the
weak points is it is relatively heavy. The flexibility may
not be needed for parsers only representing fixed data structure.
New class seems attractive for usages that can not be coverd with
the current Bio::Tree implementation.

Thanks,

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

On Tue, 17 Nov 2009 09:52:59 -0500
Diana Jaunzeikare <djaunzei at smith.edu> wrote:

> Thanks for discussion. I see Naohisa's point that it is difficult to
> keep consistency when copying a tree.
> 
> Right now PhyloXML class inherits from Bio::Tree class. Instead, I
> could write a new general Bio::FamilyTree class (per Pjotr's
> suggestion), which would be strictly a tree (I believe that Bio::Tree
> allows for a node to have 2 parents) and would have parent/child
> information. Thus it would not need underlying general graph
> implementation, therefore making the implementation simpler than that
> of Bio::Tree. Then PhyloXML::Tree would inherit from Bio::FamilyTree.
> This way PhyloXML writer probably would be even faster because it
> would not need to update Bio::Pathway structure (which is under
> Bio::Tree) every time adding a node or edge.
> Additionally, I think BioRuby would benefit from general
> Bio::FamilyTree class. I recently heard a talk by researcher who did
> phylogenetic analysis of musical rhythms.
> 
> Also I will write method to convert from newick to PhyloXML.
> 
> What do you think?
> 
> Cheers,
> Diana

From tomoakin at kenroku.kanazawa-u.ac.jp  Tue Nov 17 19:24:34 2009
From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA)
Date: Wed, 18 Nov 2009 09:24:34 +0900
Subject: [BioRuby] BioRuby Phyloxml update
In-Reply-To: <20091117162747.C1CD81CBC41B@idnmail.gen-info.osaka-u.ac.jp>
References: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com>
	<20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp>
	<4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9@mail.gmail.com>
	<4057d3bf0911170652o3683768ax366df83afcfa48e0@mail.gmail.com>
	<20091117162747.C1CD81CBC41B@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <A11D33B2-3F03-4278-86DA-F7428356845E@kenroku.kanazawa-u.ac.jp>

Hi,

One point seems that tree can be unrooted or rooted.
Perhaps, Goto-san's Bio::Tree represents unrooted tree (not  
distinguishing parents and childrenn),
while Diana's class is for rooted trees (having distinction of  
parents and children).
If, this is the point, Bio::RootedTree is better name than  
Bio::FamilyTree.
In general, rooted tree should be easily converted to unrooted tree,  
while
conversion of an unrooted tree to rooted tree requires specification  
of the root.

For text representation like NEWICK there is anyway a root while
the tree can be interpreted either as rooted or unrooted.

It could be good to have distinct interface for rooted and unrooted  
trees,
to let the user's be aware of the conceptual difference.
-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan


From tomoakin at kenroku.kanazawa-u.ac.jp  Wed Nov 18 19:33:32 2009
From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA)
Date: Thu, 19 Nov 2009 09:33:32 +0900
Subject: [BioRuby] Blast to Phylogeny
In-Reply-To: <4B045622.8040204@broadinstitute.org>
References: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com>	<20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp>	<4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9@mail.gmail.com>	<4057d3bf0911170652o3683768ax366df83afcfa48e0@mail.gmail.com>	<20091117162747.C1CD81CBC41B@idnmail.gen-info.osaka-u.ac.jp>
	<A11D33B2-3F03-4278-86DA-F7428356845E@kenroku.kanazawa-u.ac.jp>
	<4B045622.8040204@broadinstitute.org>
Message-ID: <EE017307-BE83-456E-B08D-46CBC4CFDF66@kenroku.kanazawa-u.ac.jp>

Hi,

In general, to construct a phylogenetic tree from molecular sequence  
data,
you will collect the homologous sequences, perform multiple alignment,
identify the region that will be used for the reconstruction,
and then pass the data to an appropriate program to reconstruct the  
phylogeny.

If I have a BLAST output, I would parse that file with Bio::FlatFile and
extract the identifiers of the hit sequences, use the identifiers to  
collect
individual sequences and submit the sequences to mafft for multiple  
alignment.
Convert the alignment to nexus format and manually check with  
MacClade, and then
parse the edited nexus file to write the multiple alignment readable  
by the phylogenetic
analysis program. There are many options you can take at each step.

So, there are multiple ways, but not a single simple way. :(

Bioruby has support for multiple alignment programs like mafft,  
muscle, and clustalw.
For phylogenetic reconstruction, there is some support for phylip and  
paml
(I don't have tried these feature from Bioruby library, though).
There are a number of programs for phylogenetic analysis other than  
phylip and paml.
A list compiled by J. Felsenstein is available at
http://evolution.genetics.washington.edu/phylip/software.html

An alignment similar to that of phylip will be accepted by most  
programs.
-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan


On 2009/11/19, at 5:16, Sharvari Gujja wrote:

> Hi,
>
> I am trying to construct a phylogenetic tree from Blast  
> output...Could you please let me know if there is a way to do  
> this..I have also been looking at Bio::Tree documentation but it is  
> not clear if it accepts Blast file as input.
>
> Appreciate any help.
>
> Thanks
> Sharvari


From robert.citek at gmail.com  Thu Nov 19 15:06:22 2009
From: robert.citek at gmail.com (Robert Citek)
Date: Thu, 19 Nov 2009 15:06:22 -0500
Subject: [BioRuby] custom blast scoring matrix
Message-ID: <4145b6790911191206r53c86818m280e3a149f9293ec@mail.gmail.com>

Hello all,

I would like to create a custom BLAST scoring matrix that I can use
with NCBI's blastall.  For example, let's say I want to create a
modified BLOSUM62 matrix called BLOSUM62ar, where the A:R score is now
2 instead of -1.

Some questions that I have:

1) is this possible?
2) if it is, where can I find documentation which describes how to do this?
3) is the blast output different from a regular blast?
4) if it is different, does bio-ruby have blast parsers that can parse
the output?

Thanks in advance for any pointers and suggestions.

Regards,
- Robert

From georgkam at gmail.com  Sat Nov 21 03:58:53 2009
From: georgkam at gmail.com (George Githinji)
Date: Sat, 21 Nov 2009 11:58:53 +0300
Subject: [BioRuby] custom blast scoring matrix
Message-ID: <55915f820911210058g6e089f76wc5c9cbfdad3e9966@mail.gmail.com>

Hi Martin,
Thanks for bringing the topic on list. Sometimes back i was also very
interested in custom matrices for NCBI blast.
Making custom Matrices is possible. check this out
BMC Bioinformatics 2008, 9:236 doi:10.1186/1471-2105-9-236

However making your matrices work with NCBI blast is slightly difficult as
you need to recompile the BLAST program and incoporate your modifications. I
found this a little bit not so straighforward. Lack of good documentation.

I wonder whether there is someone who has implemented the BLAST algorithm in
Ruby. (The argument is usually that the C implementation is very optimized
and good, so why would one want to implement it in ruby?) though i would not
buy that argument for learning purposes.  The closest i came to a BLAST
algorithm is an implementation of it in Perl, in the book Genomic Perl by
Rex A. Dwyer, He also outlines how to create your own matrices with code
listings in perl.

Please ping me back if you get more resources. :)
George


On Fri, Nov 20, 2009 at 8:00 PM, <bioruby-request at lists.open-bio.org> wrote:

> Send BioRuby mailing list submissions to
>        bioruby at lists.open-bio.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://lists.open-bio.org/mailman/listinfo/bioruby
> or, via email, send a message with subject or body 'help' to
>        bioruby-request at lists.open-bio.org
>
> You can reach the person managing the list at
>        bioruby-owner at lists.open-bio.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of BioRuby digest..."
>
>
> Today's Topics:
>
>   1. custom blast scoring matrix (Robert Citek)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 19 Nov 2009 15:06:22 -0500
> From: Robert Citek <robert.citek at gmail.com>
> Subject: [BioRuby] custom blast scoring matrix
> To: bioruby <bioruby at lists.open-bio.org>
> Message-ID:
>        <4145b6790911191206r53c86818m280e3a149f9293ec at mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
>
> Hello all,
>
> I would like to create a custom BLAST scoring matrix that I can use
> with NCBI's blastall.  For example, let's say I want to create a
> modified BLOSUM62 matrix called BLOSUM62ar, where the A:R score is now
> 2 instead of -1.
>
> Some questions that I have:
>
> 1) is this possible?
> 2) if it is, where can I find documentation which describes how to do this?
> 3) is the blast output different from a regular blast?
> 4) if it is different, does bio-ruby have blast parsers that can parse
> the output?
>
> Thanks in advance for any pointers and suggestions.
>
> Regards,
> - Robert
>
>
> ------------------------------
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>
>
> End of BioRuby Digest, Vol 50, Issue 10
> ***************************************
>


-- 
---------------
Sincerely
George

Skype: george_g2
Blog: http://biorelated.wordpress.com/

From robert.citek at gmail.com  Sun Nov 22 08:55:58 2009
From: robert.citek at gmail.com (Robert Citek)
Date: Sun, 22 Nov 2009 08:55:58 -0500
Subject: [BioRuby] custom blast scoring matrix
In-Reply-To: <55915f820911210058g6e089f76wc5c9cbfdad3e9966@mail.gmail.com>
References: <55915f820911210058g6e089f76wc5c9cbfdad3e9966@mail.gmail.com>
Message-ID: <4145b6790911220555q410187fak9f8b1b66e4a0ddf2@mail.gmail.com>

On Sat, Nov 21, 2009 at 3:58 AM, George Githinji <georgkam at gmail.com> wrote:
> Thanks for bringing the topic on list. Sometimes back i was also very
> interested in custom matrices for NCBI blast.
> Making custom Matrices is possible. check this out
> BMC Bioinformatics 2008, 9:236 doi:10.1186/1471-2105-9-236

Thanks for the citation.  I'll have a look into that.

> However making your matrices work with NCBI blast is slightly difficult as
> you need to recompile the BLAST program and incoporate your modifications. I
> found this a little bit not so straighforward. Lack of good documentation.

That's unfortunate.  I've tried compiling NCBI blast a few times in
the past and don't ever recall having success with it, running into
the same issues you describe.  But it's been a while and maybe the
process has become easier.  I'll give it a whirl.

> I wonder whether there is someone who has implemented the BLAST algorithm in
> Ruby. (The argument is usually that the C implementation is very optimized
> and good, so why would one want to implement it in ruby?) though i would not
> buy that argument for learning purposes. ?The closest i came to a BLAST
> algorithm is an implementation of it in Perl, in the book Genomic Perl by
> Rex A. Dwyer, He also outlines how to create your own matrices with code
> listings in perl.

Thanks.  I'll have a look at that as well.

> Please ping me back if you get more resources. :)

Will do.

Regards,
- Robert


From pjotr.public14 at thebird.nl  Thu Nov 26 08:08:30 2009
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Thu, 26 Nov 2009 14:08:30 +0100
Subject: [BioRuby] Ruby EMBOSS mapping (using Biolib)
Message-ID: <20091126130830.GA19003@thebird.nl>

Hi all,

The last year I have been working on C library mappings to Ruby. A
comparison of Bioruby against Biolib/EMBOSS six frame translation of a
C.elegans dataset shows the Ruby with EMBOSS version is about 30x
faster. On my (outdated) machine:

Bioruby version:

  22929 records 137574 times translated!
   real    9m30.952s
   user    8m42.877s
   sys     0m32.878s

Biolib version:

  22929 records 137574 times translated!
   real    0m20.306s
   user    0m15.997s
   sys     0m1.344s

This is including IO - which is handled by Ruby. 

The Bioruby code reads:

  nt = FastaReader.new(fn)
  nt.each { | rec |
      seq = Bio::Sequence::NA.new(rec.seq)
      [-3,-2,-1,1,2,3].each do | frame |
        print "> ",rec.id," ",frame.to_s,"\n"
        print seq.translate(frame),"\n"
      end
  }
  $stderr.print nt.size," records ",nt.size*6*iter," times translated!"

The Biolib code reads

  nt = FastaReader.new(fn)
  trnTable = Biolib::Emboss.ajTrnNewI(1);
  nt.each { | rec |
      ajpseq   = Biolib::Emboss.ajSeqNewNameC(rec.seq,"Test sequence")
      [-3,-2,-1,1,2,3].each do | frame |
        ajpseqt  = Biolib::Emboss.ajTrnSeqOrig(trnTable,ajpseq,frame)
        aa       = Biolib::Emboss.ajSeqGetSeqCopyC(ajpseqt)
        print "> ",rec.id," ",frame.to_s,"\n"
        print aa,"\n"
      end
  }
  $stderr.print nt.size," records ",nt.size*6*iter," times translated!"

A write up of the mapping effort is at:

  http://biolib.open-bio.org/wiki/Mapping_EMBOSS


From pjotr.public14 at thebird.nl  Thu Nov 26 08:44:27 2009
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Thu, 26 Nov 2009 14:44:27 +0100
Subject: [BioRuby] Announcing BigBio project for Ruby
Message-ID: <20091126134427.GA20660@thebird.nl>

BigBio = BIG DATA computing (for Ruby)

BigBio is an initiative to a create high performance libraries for big data
computing in biology - initially for the Ruby language. 

The Ruby version of BioBig uses BioRuby, when sensible, but provides an
interface with a different design. Also, unlike BioRuby which aims to be pure
Ruby, it uses BioLib C/C++ functions for increased performance and reduced
memory consumption.

The first module is an (indexed) FastaReader which does not load the
full FASTA file in memory. 

http://github.com/pjotrp/bigbio

Pj.

From jan.aerts at gmail.com  Thu Nov 26 08:44:58 2009
From: jan.aerts at gmail.com (Jan Aerts)
Date: Thu, 26 Nov 2009 13:44:58 +0000
Subject: [BioRuby] VCF
Message-ID: <4c7507a70911260544j4ba5f089y38c76d4f48131258@mail.gmail.com>

Is anyone working on a VCF (Variant Call Format) parser in bioruby?
http://1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcfv3.2

From jan.aerts at gmail.com  Thu Nov 26 08:46:52 2009
From: jan.aerts at gmail.com (Jan Aerts)
Date: Thu, 26 Nov 2009 13:46:52 +0000
Subject: [BioRuby] Announcing BigBio project for Ruby
In-Reply-To: <20091126134427.GA20660@thebird.nl>
References: <20091126134427.GA20660@thebird.nl>
Message-ID: <4c7507a70911260546w45839e7fra4a2565a66bc47ff@mail.gmail.com>

Interesting... Planning to incorporate SAM/BAM alignment formats for
nextgen sequences?

jan.

2009/11/26 Pjotr Prins <pjotr.public14 at thebird.nl>:
> BigBio = BIG DATA computing (for Ruby)
>
> BigBio is an initiative to a create high performance libraries for big data
> computing in biology - initially for the Ruby language.
>
> The Ruby version of BioBig uses BioRuby, when sensible, but provides an
> interface with a different design. Also, unlike BioRuby which aims to be pure
> Ruby, it uses BioLib C/C++ functions for increased performance and reduced
> memory consumption.
>
> The first module is an (indexed) FastaReader which does not load the
> full FASTA file in memory.
>
> http://github.com/pjotrp/bigbio
>
> Pj.
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>

From jan.aerts at gmail.com  Thu Nov 26 08:52:16 2009
From: jan.aerts at gmail.com (Jan Aerts)
Date: Thu, 26 Nov 2009 13:52:16 +0000
Subject: [BioRuby] Bio::DB::Sam
Message-ID: <4c7507a70911260552u5ff78223r4905a27850e20179@mail.gmail.com>

And another parser that probably should be added to bioruby: something
to interact with SAM/BAM files (which contain mapping positions for
short reads). More info at samtools.sourceforge.net

Lincoln has written a nice API for perl: Bio::DB::Sam. Maybe we should
go for something similar?
http://search.cpan.org/~lds/Bio-SamTools-1.07/lib/Bio/DB/Sam.pm

Is anyone already working on this?
jan.

From pjotr.public14 at thebird.nl  Thu Nov 26 09:17:03 2009
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Thu, 26 Nov 2009 15:17:03 +0100
Subject: [BioRuby] Bio::DB::Sam
In-Reply-To: <4c7507a70911260552u5ff78223r4905a27850e20179@mail.gmail.com>
References: <4c7507a70911260552u5ff78223r4905a27850e20179@mail.gmail.com>
Message-ID: <20091126141703.GA21032@thebird.nl>

On Thu, Nov 26, 2009 at 01:52:16PM +0000, Jan Aerts wrote:
> And another parser that probably should be added to bioruby: something
> to interact with SAM/BAM files (which contain mapping positions for
> short reads). More info at samtools.sourceforge.net

by the looks of it - it should be relatively easy with SWIG - and
therefore Biolib.

> Lincoln has written a nice API for perl: Bio::DB::Sam. Maybe we should
> go for something similar?
> http://search.cpan.org/~lds/Bio-SamTools-1.07/lib/Bio/DB/Sam.pm

Wow, this guy is hard core! Doing this with PerlXS takes a *lot* of
effort. XS is sooooo nineties ;-)

> Is anyone already working on this?

I am happy to write a SWIG mapper. If someone really cares to use
it and will write the higher-level Ruby interface (nice OOP class
representation). 

I have been told Bioruby is pure Ruby - so this will not fit in.

Pj.

From biopython at maubp.freeserve.co.uk  Thu Nov 26 11:02:50 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 26 Nov 2009 16:02:50 +0000
Subject: [BioRuby] Fwd: [DAS] DAS workshop 7th-9th April 2010
In-Reply-To: <F30A9ED7-41E9-4833-A094-FDF0893E0F92@sanger.ac.uk>
References: <F30A9ED7-41E9-4833-A094-FDF0893E0F92@sanger.ac.uk>
Message-ID: <320fb6e00911260802wb98b28fic8a193c125e29d9c@mail.gmail.com>

This might be of interest to some of you.

Peter

---------- Forwarded message ----------
From: Jonathan Warren <jw12 at sanger.ac.uk>
Date: Thu, Nov 26, 2009 at 2:57 PM
Subject: [DAS] DAS workshop 7th-9th April 2010
To: das at biodas.org, das_registry_announce at sanger.ac.uk, biojava-dev
<biojava-dev at biojava.org>, BioJava <biojava-l at biojava.org>, BioPerl
<bioperl-l at lists.open-bio.org>, all at sanger.ac.uk, all at ebi.ac.uk,
ensembldev <ensembl-dev at ebi.ac.uk>


We are considering running a Distributed Annotation System workshop
here at the Sanger/EBI in the UK subject to decent demand.
The workshop will be held from Wednesday 7th-Friday 9th April 2010. If
you would be interested in attending either to present or just take
part
then please email me jw12 at sanger.ac.uk

The format of the workshop is likely to be similar to last years (1st
day for beginners, 2nd for both beginners and advanced users, 3rd day
for advanced), information for which can be found here:
http://www.dasregistry.org/course.jsp

If you would like to present then please send a short summary of what
you would like to talk about.

Thanks

Jonathan.

Jonathan Warren
Senior Developer and DAS coordinator
jw12 at sanger.ac.uk


--
The Wellcome Trust Sanger Institute is operated by Genome
ResearchLimited, a charity registered in England with number 1021457
and acompany registered in England with number 2742969, whose
registeredoffice is 215 Euston Road, London, NW1
2BE._______________________________________________
DAS mailing list
DAS at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/das

From josejotero at gmail.com  Fri Nov 27 21:55:38 2009
From: josejotero at gmail.com (Jose Otero)
Date: Fri, 27 Nov 2009 18:55:38 -0800
Subject: [BioRuby] Bio::GenBank
Message-ID: <baa601bb0911271855k2845010buf33e14fb68def8c3@mail.gmail.com>

Hello all,
I'm new to BioRUby and I am trying to adapt  the BioGenbank class to store
information of my plasmid database.
Question 1:  Does anybody know how to insert a nucleic acid sequence as the
value to 'sequence' in the @data object?
Placing in inoformation as Bio::Feature::Qualifier objects is easy, as is
inserting Bio::Locus information.  But I can't figure how to insert the
sequence data.
Question 2:  Has anybody ever changed the data from a BioGenbank object and
save the altered file?  This would be very interesting for my plasmid
database.

Thanks for the help.
JO

From ngoto at gen-info.osaka-u.ac.jp  Sat Nov 28 04:00:01 2009
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Sat, 28 Nov 2009 18:00:01 +0900
Subject: [BioRuby] Bio::GenBank
In-Reply-To: <baa601bb0911271855k2845010buf33e14fb68def8c3@mail.gmail.com>
References: <baa601bb0911271855k2845010buf33e14fb68def8c3@mail.gmail.com>
Message-ID: <20091128090002.372041CBC49E@idnmail.gen-info.osaka-u.ac.jp>

Hello Jose,

On Fri, 27 Nov 2009 18:55:38 -0800
Jose Otero <josejotero at gmail.com> wrote:

> Hello all,
> I'm new to BioRUby and I am trying to adapt  the BioGenbank class to store
> information of my plasmid database.
> Question 1:  Does anybody know how to insert a nucleic acid sequence as the
> value to 'sequence' in the @data object?
> Placing in inoformation as Bio::Feature::Qualifier objects is easy, as is
> inserting Bio::Locus information.  But I can't figure how to insert the
> sequence data.

Once an object of the Bio::GenBank class is created, each data stored
in the object is intended to be read-only, though modification is not
explicitly prohibited. This is because the class is designed for
efficient parsing of the GenBank formatted text, and it is technically
not easy to achieve both efficient parsing and flexible modification.
(This is also applied to most parser classes, e.g. Bio::EMBL, Bio::SPTR,
etc.)

In your case, using Bio::Sequence seems the best way. After converted
to Bio::Sequence object, from a Bio::GenBank object, it can be freely
modified.

  # Assume str contains GenBank formatted text as String.
  #
  # Creating a new Bio::GenBank object.
  gb = Bio::GenBank.new(str)

  # Converting to Bio::Sequence object
  s = gb.to_biosequence

  # Modifying the sequence.
  #
  # Note that other attributes, such as features and references
  # (which depend on locations on the sequence) are kept unchanged.
  # Relocation of the features, references, etc. is relied on the
  # user.
  # 
  s.seq = 'atgc' * 10 + s.seq

  # Text formatting as the GenBank format.
  puts s.output(:genbank)

Creating a new Bio::Sequence object from scratch, giving definition,
accessions, keywords, references, features, etc., and getting
GenBank-formatted text can also be done.

> Question 2:  Has anybody ever changed the data from a BioGenbank object and
> save the altered file?  This would be very interesting for my plasmid
> database.

As described above, Bio::Sequence#output can be used. The method returns
formatted text as String, and you can easily write it to a file.

> Thanks for the help.
> JO


Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

From yannick.wurm at unil.ch  Tue Nov  3 14:11:52 2009
From: yannick.wurm at unil.ch (Yannick Wurm)
Date: Tue, 3 Nov 2009 15:11:52 +0100
Subject: [BioRuby] Ruby speed
In-Reply-To: <mailman.11.1256227205.8317.bioruby@lists.open-bio.org>
References: <mailman.11.1256227205.8317.bioruby@lists.open-bio.org>
Message-ID: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch>

Hi,

this is a more general ruby question, but since my application is  
bioinformatics, I'm posting it here.

Just wanted to prepend a few characters in front of FASTA identifiers.


$time cat clustering/dirsForAssembly/singlets.fasta | ruby -pe "gsub(/ 
^>/, '>MyPrefix')" > abc
	real	0m20.379s
	user	0m0.741s
	sys	0m0.168s


While the perl equivalent is one heck of a lot faster!!!


$time cat clustering/dirsForAssembly/singlets.fasta | perl -p -i -e 's/ 
^>/>MyPrefix/g' > ab
	real	0m2.165s
	user	0m0.266s
	sys	0m0.146s


Is there any hope for ruby?

Thanks,
yannick


--------------------------------------------
           yannick . wurm @ unil . ch
Ant Genomics, Ecology & Evolution @ Lausanne
    http://www.unil.ch/dee/page28685_fr.html


From yannick.wurm at unil.ch  Tue Nov  3 22:49:12 2009
From: yannick.wurm at unil.ch (Yannick Wurm)
Date: Tue, 3 Nov 2009 23:49:12 +0100
Subject: [BioRuby] Ruby speed
In-Reply-To: <c27b73c0911031326q7909041cx715927ccc4d487ad@mail.gmail.com>
References: <mailman.11.1256227205.8317.bioruby@lists.open-bio.org>
	<22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch>
	<c27b73c0911031326q7909041cx715927ccc4d487ad@mail.gmail.com>
Message-ID: <626B2E8A-8E3B-4643-8104-DC1143121523@unil.ch>

Hi Mike,

thanks for your response. I'm running:
ruby 1.8.6 (2008-03-03 patchlevel 114) [x86_64-linux]
Starting to age, but on a production machine I'd rather stay with what  
works than risk breaking things by upgrading them.

the command sed 's/^>/>MyPrefix/' is indeed 30% faster than perl :)

My reasons for preferring ruby are the same as yours. But a 5 to 10x  
speed difference is expensive  (I'm calling the one-liner below about  
10,000 times from a larger ruby script - YES, it's ugly, but  
refactoring the script to avoid calling that type of oneliner would be  
a pain since I use 10,000 different prefixes).

I have the feeling that it's ruby's startup-time especially. Running  
the ruby one-liner my a fasta of 40,000 sequences takes 20 seconds;  
running it a fasta of only 10 lines still takes 13 seconds!!

I found some generic benchmarks indicating that ruby is generally only  
a bit slower than perl
http://shootout.alioth.debian.org/u32/benchmark.php?test=all&lang=ruby&lang2=perl

So maybe I can keep using ruby - just avoiding one-liners!

Best,
yannick

On 3 Nov 2009, at 22:26, Michael Barton wrote:

> What version of Ruby are you using?
> Ruby is an expressive language rather than a "fast" language.
> I use Ruby because it's easer to read and maintain my programs, rather
> than because how fast it is.
>
> If you are interested purely in speed you could write in C?
> What are the benchmarks for something like this?
>
> time sed 's/^>/>MyPrefix.' clustering/dirsForAssembly/singlets.fasta  
> > abc
>
> Mike
>
> 2009/11/3 Yannick Wurm <yannick.wurm at unil.ch>:
>> Hi,
>>
>> this is a more general ruby question, but since my application is
>> bioinformatics, I'm posting it here.
>>
>> Just wanted to prepend a few characters in front of FASTA  
>> identifiers.
>>
>>
>> $time cat clustering/dirsForAssembly/singlets.fasta | ruby -pe  
>> "gsub(/^>/,
>> '>MyPrefix')" > abc
>>        real    0m20.379s
>>        user    0m0.741s
>>        sys     0m0.168s
>>
>>
>> While the perl equivalent is one heck of a lot faster!!!
>>
>>
>> $time cat clustering/dirsForAssembly/singlets.fasta | perl -p -i -e
>> 's/^>/>MyPrefix/g' > ab
>>        real    0m2.165s
>>        user    0m0.266s
>>        sys     0m0.146s
>>
>>
>> Is there any hope for ruby?
>>
>> Thanks,
>> yannick
>>
>>
>> --------------------------------------------
>>          yannick . wurm @ unil . ch
>> Ant Genomics, Ecology & Evolution @ Lausanne
>>   http://www.unil.ch/dee/page28685_fr.html
>>
>>
>> _______________________________________________
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
>>


From juanfc at uma.es  Tue Nov  3 22:44:10 2009
From: juanfc at uma.es (Juan Falgueras)
Date: Tue, 3 Nov 2009 23:44:10 +0100
Subject: [BioRuby] Ruby speed
In-Reply-To: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch>
References: <mailman.11.1256227205.8317.bioruby@lists.open-bio.org>
	<22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch>
Message-ID: <87CAA48B-151F-41C3-9DF5-23C4B43BDFD0@uma.es>


	Hi, have you tried it with Ruby 1.9?


El 03/11/2009, a las 15:11, Yannick Wurm escribi?:

> Hi,
>
> this is a more general ruby question, but since my application is  
> bioinformatics, I'm posting it here.
>
> Just wanted to prepend a few characters in front of FASTA identifiers.
>
>
> $time cat clustering/dirsForAssembly/singlets.fasta | ruby -pe "gsub 
> (/^>/, '>MyPrefix')" > abc
> 	real	0m20.379s
> 	user	0m0.741s
> 	sys	0m0.168s
>
>
> While the perl equivalent is one heck of a lot faster!!!
>
>
> $time cat clustering/dirsForAssembly/singlets.fasta | perl -p -i -e  
> 's/^>/>MyPrefix/g' > ab
> 	real	0m2.165s
> 	user	0m0.266s
> 	sys	0m0.146s
>
>
> Is there any hope for ruby?
>
> Thanks,
> yannick
>
>
> --------------------------------------------
>          yannick . wurm @ unil . ch
> Ant Genomics, Ecology & Evolution @ Lausanne
>   http://www.unil.ch/dee/page28685_fr.html
>
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From trevor at corevx.com  Tue Nov  3 23:18:50 2009
From: trevor at corevx.com (Trevor Wennblom)
Date: Tue, 3 Nov 2009 17:18:50 -0600
Subject: [BioRuby] Ruby speed
In-Reply-To: <626B2E8A-8E3B-4643-8104-DC1143121523@unil.ch>
References: <mailman.11.1256227205.8317.bioruby@lists.open-bio.org>
	<22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch>
	<c27b73c0911031326q7909041cx715927ccc4d487ad@mail.gmail.com>
	<626B2E8A-8E3B-4643-8104-DC1143121523@unil.ch>
Message-ID: <CE266AD0-6CAD-40CE-93AD-53B86C35F8D8@corevx.com>


On Nov 3, 2009, at 4:49 PM, Yannick Wurm wrote:

> I found some generic benchmarks indicating that ruby is generally  
> only a bit slower than perl
> http://shootout.alioth.debian.org/u32/benchmark.php?test=all&lang=ruby&lang2=perl

http://shootout.alioth.debian.org/u32/benchmark.php?test=all&lang=yarv&lang2=perl&box=1


From robert.citek at gmail.com  Wed Nov  4 01:32:12 2009
From: robert.citek at gmail.com (Robert Citek)
Date: Tue, 3 Nov 2009 20:32:12 -0500
Subject: [BioRuby] Ruby speed
In-Reply-To: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch>
References: <mailman.11.1256227205.8317.bioruby@lists.open-bio.org>
	<22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch>
Message-ID: <4145b6790911031732m731d0b09o199041ab0feb610c@mail.gmail.com>

On Tue, Nov 3, 2009 at 9:11 AM, Yannick Wurm <yannick.wurm at unil.ch> wrote:
> this is a more general ruby question, but since my application is
> bioinformatics, I'm posting it here.
>
> Just wanted to prepend a few characters in front of FASTA identifiers.
>
> $time cat clustering/dirsForAssembly/singlets.fasta | ruby -pe "gsub(/^>/,
> '>MyPrefix')" > abc
> ? ? ? ?real ? ?0m20.379s
> ? ? ? ?user ? ?0m0.741s
> ? ? ? ?sys ? ? 0m0.168s
>
>
> While the perl equivalent is one heck of a lot faster!!!
>
>
> $time cat clustering/dirsForAssembly/singlets.fasta | perl -p -i -e
> 's/^>/>MyPrefix/g' > ab
> ? ? ? ?real ? ?0m2.165s
> ? ? ? ?user ? ?0m0.266s
> ? ? ? ?sys ? ? 0m0.146s
>
>
> Is there any hope for ruby?

I get a factor of about three on a 10,000,000 line FASTA file:

$ time -p yes ">foo"$'\n'"bar" | head -10000000 | ruby -pe "gsub(/^>/,
'>MyPrefix')" > /dev/null
real 42.99
user 43.39
sys 0.63

$ time -p yes ">foo"$'\n'"bar" | head -10000000 | perl -pe
's/^>/>MyPrefix/g' > /dev/null
real 15.89
user 16.33
sys 0.26

This is with perl 5.8.8 and ruby 1.8.6 on a dual 1.6 GHz CPU with 512 MB RAM.

Notice your user and system times are less than a factor of three.
It's only the real time that is 10x, which suggests that ruby is
waiting on other processes, e.g. disk reads.

Regards,
- Robert


From pjotr.public14 at thebird.nl  Wed Nov  4 10:22:45 2009
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Wed, 4 Nov 2009 11:22:45 +0100
Subject: [BioRuby] Ruby speed
In-Reply-To: <22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch>
References: <mailman.11.1256227205.8317.bioruby@lists.open-bio.org>
	<22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch>
Message-ID: <20091104102245.GA13264@thebird.nl>

On Tue, Nov 03, 2009 at 03:11:52PM +0100, Yannick Wurm wrote:
> Is there any hope for ruby?

I guess you mean this tongue in cheek. However, it is dangerous as it
may turn off users looking to start with Ruby or Perl. So let me state
I think there is plenty of hope for Ruby. You are talking execution
speed of 'simple' oneliners. For complex programming Ruby outspeeds
Perl, usually in practise. Particularly the speed of getting things
done, but also a cleaner way of programming helps create better code.
The end result will often be faster. And the third gain is in the code
maintenance cycle. I am talking from experience here. I have written
a lot of code in both languages (and Python too).

Perl6 is getting interesting. The syntax is much cleaned up, proper
OOP, and (what I like) strong functional programming support. But its
execution speed is not even close to Ruby's now. I have heard people
joke that Ruby is what Perl6 was meant to be.

Anyway you can see where the Perl folks are heading.

Pj.

P.S. What is there to stop you from using both languages?


From mail at michaelbarton.me.uk  Wed Nov  4 11:24:36 2009
From: mail at michaelbarton.me.uk (Michael Barton)
Date: Wed, 4 Nov 2009 11:24:36 +0000
Subject: [BioRuby] Ruby speed
In-Reply-To: <626B2E8A-8E3B-4643-8104-DC1143121523@unil.ch>
References: <mailman.11.1256227205.8317.bioruby@lists.open-bio.org> 
	<22656BBF-3EE1-44DD-BA88-6BC1FA3D4E53@unil.ch>
	<c27b73c0911031326q7909041cx715927ccc4d487ad@mail.gmail.com> 
	<626B2E8A-8E3B-4643-8104-DC1143121523@unil.ch>
Message-ID: <c27b73c0911040324s66de2a97jde807bcb9fea78b@mail.gmail.com>

2009/11/3 Yannick Wurm <yannick.wurm at unil.ch>:
> thanks for your response. I'm running:
> ruby 1.8.6 (2008-03-03 patchlevel 114) [x86_64-linux]
> Starting to age, but on a production machine I'd rather stay with what works
> than risk breaking things by upgrading them.

I think Ruby 1.9 is now the official Ruby release, so you might want
to start trying out using this version, for example Rails 3.0 won't
work with Ruby 1.8.6 anymore. I've tried Ruby 1.9 a bit myself and the
requirements for compatibility are relatively small. If you still
prefer to use 1.8, you could try using REE
(http://www.rubyenterpriseedition.com/) which has a few patches to
improve performance over vanilla 1.8. You could try using
ruby_switcher which makes trying different ruby versions a bit less
painful - http://bit.ly/1kY1Qk

> the command sed 's/^>/>MyPrefix/' is indeed 30% faster than perl :)

Could you just try calling out to sed then?

> I have the feeling that it's ruby's startup-time especially. Running the
> ruby one-liner my a fasta of 40,000 sequences takes 20 seconds; running it a
> fasta of only 10 lines still takes 13 seconds!!

You might also want to try experimenting with gsub! instead of gsub as
the former does destructive in place substitution while the latter
creates an extra object with the substituted text. This extra object
creation might also slow performance.

Cheers Mike


From diapriid at gmail.com  Wed Nov  4 18:29:13 2009
From: diapriid at gmail.com (Matt)
Date: Wed, 4 Nov 2009 13:29:13 -0500
Subject: [BioRuby] basic remote query and plans for NCBI Bio:Blast hook?
Message-ID: <19d6b9770911041029w1142e68ag1d739675c75514f9@mail.gmail.com>

Hi all,

As far as I can tell there is yet no straightforward way to use
Bio:Blast with the NCBI portal? I've seen this on the wiki: "Add
remote BLAST search sites", and understand the basic concept, but
don't have time at present to work on this.  Is anyone actively
working on this? (just FYI see
http://github.com/kwicher/ruby-blast-at-ncbi).

I ask in part because I'm struggling to get a basic remote blast working:

seq = Bio::Sequence::NA.new('GTCACAAAATCATGGTTTTGCGGTTAATGCTAATGATTTGCCAGCTGATTGGGAACCATTATTTACAAATGCGAACGACAATACAAATGAAGGAATTGTACACAAAACACATCCATTCTTTAGTGTACAATTTCATCCCGAACACACAGCCGGTCCAGAAGATTTAGAAATCTTATTTGATGTCTTTCTGGATGGAGTAAAAGCATTTAAAAATAAGGAAAAGTTCAYCATGAARGATAAATTGATCGAAAAATTGACTTACACGCCGGATGTACCCGTTTGCACTGAAAAACCTAAAAAGATATTGATTTTAGGTTCAGGCGGTTTATCCATAGGYCAAGCAGGCGAATTTGATTATTCCGGATCTCAGGCTATCAAGGCTCTTAAAGAAGAAAAAATACAAACGGTGYTAATAAATCCAAATATTGCAACGGTTCARACATCAAAAGGCCTTGCGGACAAAGTTTACTTCCTACCCATTACACCGGATTACGTTGAACAGGTTATAAAAGCCGAGCGACCTGATGGTGTGCTTTTAACTTTTGGCGGACAAACAGCTTTGAATTGTGGAATTGAATTAGAAAAAACTAAAGTGTTTCAACGATTCGGTGTTAAAGTGTTGGGTACRCCGATACAATCAATTATTGAAACTGAAGATAGAAAAATATTTTCGGATCGAGTACACGAAATCGGAGAAAAAGTAGCGCCGTCTGCCGCAGTTTATTCGGTGCAAGAAGCTCTAGATGCCGCTGAAATTCTTGGTTATCCCGTTATGGCTCGAGCTGCATTTTCATTAGGTGGACTAGGTTCTGGTTTTGCAAATAATATTGATGAATTAAAACATCTTGCACAACAGGCTCTTGCGCATTCCAACCAGTTAATCATTGATAAATCGCTTAAAGGTTGGAAGGAAGTTGAATACGAGGTCGTTCGTGATGCATATGACAATTGTATTACAGTTTGTAATATGGAAAATGTAGATCCACTAGGAATTCATACAGGGGAGAGTATAGTAGTGGCACCGTCACAAACTCTCTCCAACAAGGAATATAATATGTTGCGTACTACAGCAATTAAAGTGATTCGGCATTTTGGCGTCGTCGGTGAATGTAATATACAATATGCCTTAAATCCACATTCYGAGCAATACTATATAATTGAAGTTAATGCTAGGTTATCGAGGAGTTCGGCACTAGCTAGTAAAGCGACAGGCTATCCATTAGCATACGTTGCGGCTAAACTAGCACTCGGTATCGCTTTACCTGATATTAAAAATTCGGTAACTGGAGTTACCACCGCCTGTTTTGAGCCAAGTTTAGATTACTGTGTGGTAAAAATTCCACGATGGGATTTAGCAAAATTTGTTCGCGTTTCAAAAAATATTGGAAGCTCTATGAAAAGTGTAGGTGAGGTCATGGCAATCGGCCGCCGATTTGAAGAAGCGTTCCAAAAA')

blast_factory = Bio::Blast.new('blastn','nr-nt', '', 'genomenet')
foo = blast_factory.query(seq)

... freezes, when I ctrl-C

from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast/genomenet.rb:224:in
`call'
from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast/genomenet.rb:224:in
`sleep'
from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast/genomenet.rb:224:in
`exec_genomenet'
from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast.rb:368:in
`__send__'
from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast.rb:368:in
`query'
from (irb):25

any glaring problems with this? Is it just waiting for the results of
the remote query?   I noticed that the genomenet blasts are much
slower than NCBI in general (I'm in the US).

thanks,
Matt


From diapriid at gmail.com  Wed Nov  4 19:57:11 2009
From: diapriid at gmail.com (Matt)
Date: Wed, 4 Nov 2009 14:57:11 -0500
Subject: [BioRuby] (previous answered in part) timeout/long time
Message-ID: <19d6b9770911041157l1556ac89s4e8c62ad2e20460d@mail.gmail.com>

Aha- my queries *are* working, just taking a very long time to finish.

Can I limit to say top 10 results?

cheers,
Matt


From yannick.wurm at unil.ch  Wed Nov  4 19:56:13 2009
From: yannick.wurm at unil.ch (Yannick Wurm)
Date: Wed, 4 Nov 2009 20:56:13 +0100
Subject: [BioRuby] Ruby speed
Message-ID: <81E8B742-2508-40DF-8E81-07F1C8126839@unil.ch>

> Notice your user and system times are less than a factor of three.
> It's only the real time that is 10x, which suggests that ruby is
> waiting on other processes, e.g. disk reads.

Great point Robert - I hadn't seen that. My guess the difference is  
due to the fact that ruby is only installed in my networked (sfs) home  
dir on the linux server, not on the local machine like perl is. Gotta  
get the sysadmins to install ruby :)

cheers!
yannick


From email2ants at gmail.com  Thu Nov  5 16:22:12 2009
From: email2ants at gmail.com (Anthony Underwood)
Date: Thu, 5 Nov 2009 16:22:12 +0000
Subject: [BioRuby] basic remote query and plans for NCBI Bio:Blast hook?
In-Reply-To: <19d6b9770911041029w1142e68ag1d739675c75514f9@mail.gmail.com>
References: <19d6b9770911041029w1142e68ag1d739675c75514f9@mail.gmail.com>
Message-ID: <86C24368-84E1-4A43-ABBD-A26B998159B2@gmail.com>

Hi Matt

I have done a bit of work to get NCBI blast working within bioruby.


See this gist on github http://gist.github.com/227160

ncbi_blast.rb defines an exec_ncbi class for the Blast class in bioruby
The script ncbi_blast_test.rb illustrates its usage but uses a few  
functions defined in the blast_functions.rb file


essentially the following should work

require 'rubygems'
require 'bio'
require 'ncbi_blast'
ENV['http_proxy'] = "http://proxy_server_ip:port_numer" # use this if  
you are working from behind a proxy and enter ip and port number as  
appropriate

sequence = "ATGAATCCAAATCAGAAAATAATAA........"

factory = Bio::Blast.remote('blastn', 'nr', '', 'ncbi')
blast_report = factory.query(sequence)


blast_report will be a Bio::Blast::Report object which can be parsed  
as described in the bioruby api

The hit definitions are fairly uninformative containing just the  
accessions. This is why I then have to fetch the data fro embl as  
follows

     accession = definition.split("|")[3]
     accession.sub!(/\..+$/, "") # remove version number
     server = Bio::Fetch.new('http://www.ebi.ac.uk/cgi-bin/dbfetch')
     embl_text = server.fetch('embl', accession)
     embl_object = Bio::EMBL.new(embl_text)
     puts embl_object.description


This is still a work in progress but it worked OK for me. Hope it is  
of some use to you.


Anthony


On 4 Nov 2009, at 18:29, Matt wrote:

> Hi all,
>
> As far as I can tell there is yet no straightforward way to use
> Bio:Blast with the NCBI portal? I've seen this on the wiki: "Add
> remote BLAST search sites", and understand the basic concept, but
> don't have time at present to work on this.  Is anyone actively
> working on this? (just FYI see
> http://github.com/kwicher/ruby-blast-at-ncbi).
>
> I ask in part because I'm struggling to get a basic remote blast  
> working:
>
> seq =  
> Bio::Sequence::NA.new('GTCACAAAATCATGGTTTTGCGGTTAATGCTAATGATTTGCCAGCTGATTGGGAACCATTATTTACAAATGCGAACGACAATACAAATGAAGGAATTGTACACAAAACACATCCATTCTTTAGTGTACAATTTCATCCCGAACACACAGCCGGTCCAGAAGATTTAGAAATCTTATTTGATGTCTTTCTGGATGGAGTAAAAGCATTTAAAAATAAGGAAAAGTTCAYCATGAARGATAAATTGATCGAAAAATTGACTTACACGCCGGATGTACCCGTTTGCACTGAAAAACCTAAAAAGATATTGATTTTAGGTTCAGGCGGTTTATCCATAGGYCAAGCAGGCGAATTTGATTATTCCGGATCTCAGGCTATCAAGGCTCTTAAAGAAGAAAAAATACAAACGGTGYTAATAAATCCAAATATTGCAACGGTTCARACATCAAAAGGCCTTGCGGACAAAGTTTACTTCCTACCCATTACACCGGATTACGTTGAACAGGTTATAAAAGCCGAGCGACCTGATGGTGTGCTTTTAACTTTTGGCGGACAAACAGCTTTGAATTGTGGAATTGAATTAGAAAAAACTAAAGTGTTTCAACGATTCGGTGTTAAAGTGTTGGGTACRCCGATACAATCAATTATTGAAACTGAAGATAGAAAAATATTTTCGGATCGAGTACACGAAATCGGAGAAAAAGTAGCGCCGTCTGCCGCAGTTTATTCGGTGCAAGAAGCTCTAGATGCCGCTGAAATTCTTGGTTATCCCGTTATGGCTCGAGCTGCATTTTCATTAGGTGGACTAGGTTCTGGTTTTGCAAATAATATTGATGAATTAAAACATCTTGCACAACAGGCTCTTGCGCATTCCAACCAGTTAATCATTGATAAATCGCTTAAAGGTTGGAAGGAAGTTGAATACGAGGTCGTTCGTGATGCATATGACAATTGTATTACAGT!
> TTGTAATATGGAAAATGTAGATCCACTAGGAATTCATACAGGGGAGAGTATAGTAGTGGCACCGTCACAAACTCTCTCCAACAAGGAATATAATATGTTGCGTACTACAGCAATTAAAGTGATTCGGCATTTTGGCGTCGTCGGTGAATGTAATATACAATATGCCTTAAATCCACATTCYGAGCAATACTATATAATTGAAGTTAATGCTAGGTTATCGAGGAGTTCGGCACTAGCTAGTAAAGCGACAGGCTATCCATTAGCATACGTTGCGGCTAAACTAGCACTCGGTATCGCTTTACCTGATATTAAAAATTCGGTAACTGGAGTTACCACCGCCTGTTTTGAGCCAAGTTTAGATTACTGTGTGGTAAAAATTCCACGATGGGATTTAGCAAAATTTGTTCGCGTTTCAAAAAATATTGGAAGCTCTATGAAAAGTGTAGGTGAGGTCATGGCAATCGGCCGCCGATTTGAAGAAGCGTTCCAAAAA')
>
> blast_factory = Bio::Blast.new('blastn','nr-nt', '', 'genomenet')
> foo = blast_factory.query(seq)
>
> ... freezes, when I ctrl-C
>
> from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast/ 
> genomenet.rb:224:in
> `call'
> from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast/ 
> genomenet.rb:224:in
> `sleep'
> from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/blast/ 
> genomenet.rb:224:in
> `exec_genomenet'
> from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/ 
> blast.rb:368:in
> `__send__'
> from /Library/Ruby/Gems/1.8/gems/bio-1.3.1.5000/lib/bio/appl/ 
> blast.rb:368:in
> `query'
> from (irb):25
>
> any glaring problems with this? Is it just waiting for the results of
> the remote query?   I noticed that the genomenet blasts are much
> slower than NCBI in general (I'm in the US).
>
> thanks,
> Matt
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From kenglish at gmail.com  Thu Nov  5 16:43:31 2009
From: kenglish at gmail.com (Kevin English)
Date: Thu, 5 Nov 2009 06:43:31 -1000
Subject: [BioRuby] basic remote query and plans for NCBI Bio:Blast hook?
In-Reply-To: <19d6b9770911041029w1142e68ag1d739675c75514f9@mail.gmail.com>
References: <19d6b9770911041029w1142e68ag1d739675c75514f9@mail.gmail.com>
Message-ID: <d82e7f560911050843v508f36c5x504cc8924b62578f@mail.gmail.com>

Have you considered downloading the nr-nt databases and running local queries?

I played with the Blast Remote for a while but determined it was too
slow for our workload...

Kevin


From yannick.wurm at unil.ch  Thu Nov  5 20:06:33 2009
From: yannick.wurm at unil.ch (Yannick Wurm)
Date: Thu, 5 Nov 2009 21:06:33 +0100
Subject: [BioRuby] BioRuby Digest, Vol 50, Issue 1
In-Reply-To: <mailman.19.1257354004.14930.bioruby@lists.open-bio.org>
References: <mailman.19.1257354004.14930.bioruby@lists.open-bio.org>
Message-ID: <D4414522-8174-4599-9F15-350E5FC63D99@unil.ch>

On 4 Nov 2009, at 18:00, bioruby-request at lists.open-bio.org wrote:

> I guess you mean this tongue in cheek. However, it is dangerous as it
> may turn off users looking to start with Ruby or Perl. So let me state
> I think there is plenty of hope for Ruby. You are talking execution
> speed of 'simple' oneliners. For complex programming Ruby outspeeds
> Perl, usually in practise. Particularly the speed of getting things
> done, but also a cleaner way of programming helps create better code.
> The end result will often be faster. And the third gain is in the code
> maintenance cycle. I am talking from experience here. I have written
> a lot of code in both languages (and Python too).

Those are excellent points, Pjotr.


> Perl6 is getting interesting. The syntax is much cleaned up, proper
> OOP, and (what I like) strong functional programming support. But its
> execution speed is not even close to Ruby's now. I have heard people
> joke that Ruby is what Perl6 was meant to be.
>
> Anyway you can see where the Perl folks are heading.

Yes, Damion Conway of Perl Best Practices gave us a small workshop  
recently, and I could help but thinking that Perl6 was an attempt to  
rubify perl :)

> P.S. What is there to stop you from using both languages?

Nothing official. But I already find it difficult to keep the R, bash  
and ruby parts of my brain optimized without mixing in perl and  
others :)

Cheers,
yannick


From rob.syme at gmail.com  Fri Nov  6 02:55:45 2009
From: rob.syme at gmail.com (Rob Syme)
Date: Fri, 6 Nov 2009 10:55:45 +0800
Subject: [BioRuby] Parsing large blastout.xml files
Message-ID: <f2863e380911051855n2f613747ofdb877023179294c@mail.gmail.com>

I'm trying to extract information from a large blast xml file. To parse the
xml file, ruby reads the whole file into memory before looking at each
entry. For large files (2.5GBish) - the memory requirements become severe.

My first approach was to split each query up into its own <BlastOutput> xml
instance, so that

<BlastOutput>
  <BlastOutput_iterations>
    <Iteration>
      <Iteration_hits>
        <Hit></Hit>
        <Hit></Hit>
        <Hit></Hit>
      </Iteration_hits>
    </Iteration>
    <Iteration>
      <Iteration_hits>
        <Hit></Hit>
        <Hit></Hit>
        <Hit></Hit>
      </Iteration_hits>
    </Iteration>
    <Iteration>
      <Iteration_hits>
        <Hit></Hit>
        <Hit></Hit>
        <Hit></Hit>
      </Iteration_hits>
    </Iteration>
  </BlastOutput_iterations>
</BlastOutput>

Would end up looking more like:
<BlastOutput>
  <BlastOutput_iterations>
    <Iteration>
      <Iteration_hits>
        <Hit></Hit>
        <Hit></Hit>
        <Hit></Hit>
      </Iteration_hits>
    </Iteration>
  </BlastOutput_iterations>
</BlastOutput>

<BlastOutput>
  <BlastOutput_iterations>
    <Iteration>
      <Iteration_hits>
        <Hit></Hit>
        <Hit></Hit>
        <Hit></Hit>
      </Iteration_hits>
    </Iteration>
  </BlastOutput_iterations>
</BlastOutput>

<BlastOutput>
  <BlastOutput_iterations>
    <Iteration>
      <Iteration_hits>
        <Hit></Hit>
        <Hit></Hit>
        <Hit></Hit>
      </Iteration_hits>
    </Iteration>
  </BlastOutput_iterations>
</BlastOutput>

Which bioruby has trouble parsing, so the <BlastOutput>s had to be given
their own file:

$ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}'

Now each file can be parsed individually. I feel like there has to be an
easier way. Is there a way to parse large xml files without huge memory
overheads, or is that just par for the course?


From rozziite at gmail.com  Fri Nov  6 03:11:32 2009
From: rozziite at gmail.com (Diana Jaunzeikare)
Date: Thu, 5 Nov 2009 22:11:32 -0500
Subject: [BioRuby] Parsing large blastout.xml files
In-Reply-To: <f2863e380911051855n2f613747ofdb877023179294c@mail.gmail.com>
References: <f2863e380911051855n2f613747ofdb877023179294c@mail.gmail.com>
Message-ID: <4057d3bf0911051911v1b98a2c7ka17392afb6204e37@mail.gmail.com>

Another option is to use ruby-libxml reader.
http://libxml.rubyforge.org/rdoc/index.html  It reads the data
sequentially thus there is no memory overhead of first reading it all
in memory. However, then you would have to parse it from scratch.

On that note, maybe it is worth implementing Bio::Blast::Report.libxml
or something like that the same way there is Bio::Blast::Report.rexml
and Bio::Blast::Report.xmlparser. Dependecy to ruby-libxml in BioRuby
library was introducted in PhyloXML parser.

Diana

On Thu, Nov 5, 2009 at 9:55 PM, Rob Syme <rob.syme at gmail.com> wrote:
> I'm trying to extract information from a large blast xml file. To parse the
> xml file, ruby reads the whole file into memory before looking at each
> entry. For large files (2.5GBish) - the memory requirements become severe.
>
> My first approach was to split each query up into its own <BlastOutput> xml
> instance, so that
>
> <BlastOutput>
> ?<BlastOutput_iterations>
> ? ?<Iteration>
> ? ? ?<Iteration_hits>
> ? ? ? ?<Hit></Hit>
> ? ? ? ?<Hit></Hit>
> ? ? ? ?<Hit></Hit>
> ? ? ?</Iteration_hits>
> ? ?</Iteration>
> ? ?<Iteration>
> ? ? ?<Iteration_hits>
> ? ? ? ?<Hit></Hit>
> ? ? ? ?<Hit></Hit>
> ? ? ? ?<Hit></Hit>
> ? ? ?</Iteration_hits>
> ? ?</Iteration>
> ? ?<Iteration>
> ? ? ?<Iteration_hits>
> ? ? ? ?<Hit></Hit>
> ? ? ? ?<Hit></Hit>
> ? ? ? ?<Hit></Hit>
> ? ? ?</Iteration_hits>
> ? ?</Iteration>
> ?</BlastOutput_iterations>
> </BlastOutput>
>
> Would end up looking more like:
> <BlastOutput>
> ?<BlastOutput_iterations>
> ? ?<Iteration>
> ? ? ?<Iteration_hits>
> ? ? ? ?<Hit></Hit>
> ? ? ? ?<Hit></Hit>
> ? ? ? ?<Hit></Hit>
> ? ? ?</Iteration_hits>
> ? ?</Iteration>
> ?</BlastOutput_iterations>
> </BlastOutput>
>
> <BlastOutput>
> ?<BlastOutput_iterations>
> ? ?<Iteration>
> ? ? ?<Iteration_hits>
> ? ? ? ?<Hit></Hit>
> ? ? ? ?<Hit></Hit>
> ? ? ? ?<Hit></Hit>
> ? ? ?</Iteration_hits>
> ? ?</Iteration>
> ?</BlastOutput_iterations>
> </BlastOutput>
>
> <BlastOutput>
> ?<BlastOutput_iterations>
> ? ?<Iteration>
> ? ? ?<Iteration_hits>
> ? ? ? ?<Hit></Hit>
> ? ? ? ?<Hit></Hit>
> ? ? ? ?<Hit></Hit>
> ? ? ?</Iteration_hits>
> ? ?</Iteration>
> ?</BlastOutput_iterations>
> </BlastOutput>
>
> Which bioruby has trouble parsing, so the <BlastOutput>s had to be given
> their own file:
>
> $ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}'
>
> Now each file can be parsed individually. I feel like there has to be an
> easier way. Is there a way to parse large xml files without huge memory
> overheads, or is that just par for the course?
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>


From adamnkraut at gmail.com  Fri Nov  6 03:17:02 2009
From: adamnkraut at gmail.com (Adam)
Date: Thu, 5 Nov 2009 22:17:02 -0500
Subject: [BioRuby] Parsing large blastout.xml files
In-Reply-To: <f2863e380911051855n2f613747ofdb877023179294c@mail.gmail.com>
References: <f2863e380911051855n2f613747ofdb877023179294c@mail.gmail.com>
Message-ID: <134ede0b0911051917nf0877e0y8df95c3147a24d07@mail.gmail.com>

You might want to try a SAX Parser instead.

REXML from the standard library has a streaming API.  LibXML is a lot faster
and it's available as a gem.

http://libxml.rubyforge.org/

On Thu, Nov 5, 2009 at 9:55 PM, Rob Syme <rob.syme at gmail.com> wrote:

> I'm trying to extract information from a large blast xml file. To parse the
> xml file, ruby reads the whole file into memory before looking at each
> entry. For large files (2.5GBish) - the memory requirements become severe.
>
> My first approach was to split each query up into its own <BlastOutput> xml
> instance, so that
>
> <BlastOutput>
>  <BlastOutput_iterations>
>    <Iteration>
>      <Iteration_hits>
>        <Hit></Hit>
>        <Hit></Hit>
>        <Hit></Hit>
>      </Iteration_hits>
>    </Iteration>
>    <Iteration>
>      <Iteration_hits>
>        <Hit></Hit>
>        <Hit></Hit>
>        <Hit></Hit>
>      </Iteration_hits>
>    </Iteration>
>    <Iteration>
>      <Iteration_hits>
>        <Hit></Hit>
>        <Hit></Hit>
>        <Hit></Hit>
>      </Iteration_hits>
>    </Iteration>
>  </BlastOutput_iterations>
> </BlastOutput>
>
> Would end up looking more like:
> <BlastOutput>
>  <BlastOutput_iterations>
>    <Iteration>
>      <Iteration_hits>
>        <Hit></Hit>
>        <Hit></Hit>
>        <Hit></Hit>
>      </Iteration_hits>
>    </Iteration>
>  </BlastOutput_iterations>
> </BlastOutput>
>
> <BlastOutput>
>  <BlastOutput_iterations>
>    <Iteration>
>      <Iteration_hits>
>        <Hit></Hit>
>        <Hit></Hit>
>        <Hit></Hit>
>      </Iteration_hits>
>    </Iteration>
>  </BlastOutput_iterations>
> </BlastOutput>
>
> <BlastOutput>
>  <BlastOutput_iterations>
>    <Iteration>
>      <Iteration_hits>
>        <Hit></Hit>
>        <Hit></Hit>
>        <Hit></Hit>
>      </Iteration_hits>
>    </Iteration>
>  </BlastOutput_iterations>
> </BlastOutput>
>
> Which bioruby has trouble parsing, so the <BlastOutput>s had to be given
> their own file:
>
> $ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}'
>
> Now each file can be parsed individually. I feel like there has to be an
> easier way. Is there a way to parse large xml files without huge memory
> overheads, or is that just par for the course?
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>


From pjotr.public14 at thebird.nl  Fri Nov  6 08:58:15 2009
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Fri, 6 Nov 2009 09:58:15 +0100
Subject: [BioRuby] Parsing large blastout.xml files
In-Reply-To: <4057d3bf0911051911v1b98a2c7ka17392afb6204e37@mail.gmail.com>
References: <f2863e380911051855n2f613747ofdb877023179294c@mail.gmail.com>
	<4057d3bf0911051911v1b98a2c7ka17392afb6204e37@mail.gmail.com>
Message-ID: <20091106085815.GA12244@thebird.nl>

Diana is right. We need to revamp the implementation for big results.
Not only that, the current implementation has method names do not
match the BLAST names. I need something like this pretty soon and
was thinking of writing it.

Pj.

On Thu, Nov 05, 2009 at 10:11:32PM -0500, Diana Jaunzeikare wrote:
> Another option is to use ruby-libxml reader.
> http://libxml.rubyforge.org/rdoc/index.html  It reads the data
> sequentially thus there is no memory overhead of first reading it all
> in memory. However, then you would have to parse it from scratch.
> 
> On that note, maybe it is worth implementing Bio::Blast::Report.libxml
> or something like that the same way there is Bio::Blast::Report.rexml
> and Bio::Blast::Report.xmlparser. Dependecy to ruby-libxml in BioRuby
> library was introducted in PhyloXML parser.
> 
> Diana
> 
> On Thu, Nov 5, 2009 at 9:55 PM, Rob Syme <rob.syme at gmail.com> wrote:
> > I'm trying to extract information from a large blast xml file. To parse the
> > xml file, ruby reads the whole file into memory before looking at each
> > entry. For large files (2.5GBish) - the memory requirements become severe.
> >
> > My first approach was to split each query up into its own <BlastOutput> xml
> > instance, so that
> >
> > <BlastOutput>
> > ?<BlastOutput_iterations>
> > ? ?<Iteration>
> > ? ? ?<Iteration_hits>
> > ? ? ? ?<Hit></Hit>
> > ? ? ? ?<Hit></Hit>
> > ? ? ? ?<Hit></Hit>
> > ? ? ?</Iteration_hits>
> > ? ?</Iteration>
> > ? ?<Iteration>
> > ? ? ?<Iteration_hits>
> > ? ? ? ?<Hit></Hit>
> > ? ? ? ?<Hit></Hit>
> > ? ? ? ?<Hit></Hit>
> > ? ? ?</Iteration_hits>
> > ? ?</Iteration>
> > ? ?<Iteration>
> > ? ? ?<Iteration_hits>
> > ? ? ? ?<Hit></Hit>
> > ? ? ? ?<Hit></Hit>
> > ? ? ? ?<Hit></Hit>
> > ? ? ?</Iteration_hits>
> > ? ?</Iteration>
> > ?</BlastOutput_iterations>
> > </BlastOutput>
> >
> > Would end up looking more like:
> > <BlastOutput>
> > ?<BlastOutput_iterations>
> > ? ?<Iteration>
> > ? ? ?<Iteration_hits>
> > ? ? ? ?<Hit></Hit>
> > ? ? ? ?<Hit></Hit>
> > ? ? ? ?<Hit></Hit>
> > ? ? ?</Iteration_hits>
> > ? ?</Iteration>
> > ?</BlastOutput_iterations>
> > </BlastOutput>
> >
> > <BlastOutput>
> > ?<BlastOutput_iterations>
> > ? ?<Iteration>
> > ? ? ?<Iteration_hits>
> > ? ? ? ?<Hit></Hit>
> > ? ? ? ?<Hit></Hit>
> > ? ? ? ?<Hit></Hit>
> > ? ? ?</Iteration_hits>
> > ? ?</Iteration>
> > ?</BlastOutput_iterations>
> > </BlastOutput>
> >
> > <BlastOutput>
> > ?<BlastOutput_iterations>
> > ? ?<Iteration>
> > ? ? ?<Iteration_hits>
> > ? ? ? ?<Hit></Hit>
> > ? ? ? ?<Hit></Hit>
> > ? ? ? ?<Hit></Hit>
> > ? ? ?</Iteration_hits>
> > ? ?</Iteration>
> > ?</BlastOutput_iterations>
> > </BlastOutput>
> >
> > Which bioruby has trouble parsing, so the <BlastOutput>s had to be given
> > their own file:
> >
> > $ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}'
> >
> > Now each file can be parsed individually. I feel like there has to be an
> > easier way. Is there a way to parse large xml files without huge memory
> > overheads, or is that just par for the course?
> > _______________________________________________
> > BioRuby mailing list
> > BioRuby at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioruby
> >
> 
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From pjotr.public14 at thebird.nl  Sat Nov  7 07:42:44 2009
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Sat, 7 Nov 2009 08:42:44 +0100
Subject: [BioRuby] Parsing large blastout.xml files
In-Reply-To: <f2863e380911051855n2f613747ofdb877023179294c@mail.gmail.com>
References: <f2863e380911051855n2f613747ofdb877023179294c@mail.gmail.com>
Message-ID: <20091107074244.GA22748@thebird.nl>

I did the same a while back using xmltwig:

  http://github.com/pjotrp/biotools/blob/master/bin/blast_split_xml

On Fri, Nov 06, 2009 at 10:55:45AM +0800, Rob Syme wrote:
> I'm trying to extract information from a large blast xml file. To parse the
> xml file, ruby reads the whole file into memory before looking at each
> entry. For large files (2.5GBish) - the memory requirements become severe.
> 
> My first approach was to split each query up into its own <BlastOutput> xml
> instance, so that
> 
> <BlastOutput>
>   <BlastOutput_iterations>
>     <Iteration>
>       <Iteration_hits>
>         <Hit></Hit>
>         <Hit></Hit>
>         <Hit></Hit>
>       </Iteration_hits>
>     </Iteration>
>     <Iteration>
>       <Iteration_hits>
>         <Hit></Hit>
>         <Hit></Hit>
>         <Hit></Hit>
>       </Iteration_hits>
>     </Iteration>
>     <Iteration>
>       <Iteration_hits>
>         <Hit></Hit>
>         <Hit></Hit>
>         <Hit></Hit>
>       </Iteration_hits>
>     </Iteration>
>   </BlastOutput_iterations>
> </BlastOutput>
> 
> Would end up looking more like:
> <BlastOutput>
>   <BlastOutput_iterations>
>     <Iteration>
>       <Iteration_hits>
>         <Hit></Hit>
>         <Hit></Hit>
>         <Hit></Hit>
>       </Iteration_hits>
>     </Iteration>
>   </BlastOutput_iterations>
> </BlastOutput>
> 
> <BlastOutput>
>   <BlastOutput_iterations>
>     <Iteration>
>       <Iteration_hits>
>         <Hit></Hit>
>         <Hit></Hit>
>         <Hit></Hit>
>       </Iteration_hits>
>     </Iteration>
>   </BlastOutput_iterations>
> </BlastOutput>
> 
> <BlastOutput>
>   <BlastOutput_iterations>
>     <Iteration>
>       <Iteration_hits>
>         <Hit></Hit>
>         <Hit></Hit>
>         <Hit></Hit>
>       </Iteration_hits>
>     </Iteration>
>   </BlastOutput_iterations>
> </BlastOutput>
> 
> Which bioruby has trouble parsing, so the <BlastOutput>s had to be given
> their own file:
> 
> $ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}'
> 
> Now each file can be parsed individually. I feel like there has to be an
> easier way. Is there a way to parse large xml files without huge memory
> overheads, or is that just par for the course?
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From djaunzei at smith.edu  Sun Nov  8 03:50:26 2009
From: djaunzei at smith.edu (Diana Jaunzeikare)
Date: Sat, 7 Nov 2009 22:50:26 -0500
Subject: [BioRuby] BioRuby Phyloxml update
Message-ID: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com>

Hi all,

So finally I have updated Bio::Tree and Bio::Node classes to improve
the phyloxml writer speed.

* Added Bio::Node::parent and  Bio::Node::children (array of nodes) in
order to avoid calling Tree::parent(node) or Tree::children(node),
because those methods call breath first search on the underlying
graph, which makes PhyloXML writer and parser incredibly slow. In
contrast, Bio::Node::parent and Bio::Node::children keeps references
to the respective nodes.
* Updated  Tree::add_edge, Tree::clear_edge, Tree::remove_edge,
Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep
track of Node::parent and Node::children nodes correctly.  Have I
forgotten anything?
* Now for PhyloXML writer it takes less than 1 second instead of
~20minutes to write ncbi_taxonomy_mollusca.xml file 1.5MB
* To write the tree of life taxonomy file (~46MB) it takes 10 seconds
(On 2.4GHz, 2.9GB RAM, running Ubuntu)

The code is in http://github.com/latvianlinuxgirl/bioruby/tree/tree_class

I wrote unit tests for my changes and made sure my changes don't break
anything else. However, does anybody has code laying around that uses
Tree::parent and Tree::children methods so that I can test it more
thoroughly?

Cheers,
Diana


From ngoto at gen-info.osaka-u.ac.jp  Sun Nov  8 12:50:56 2009
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa Goto)
Date: Sun, 08 Nov 2009 21:50:56 +0900
Subject: [BioRuby] BioRuby Phyloxml update
In-Reply-To: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com>
References: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com>
Message-ID: <20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp>

Hi Diana,

I'm sorry that the changes cannot be accepted, because the
modification of existing Bio::Tree methods breaks things.
Bio::Tree does not want to have children/parent information
in nodes. One of the reasons is that it is difficult to keep
consistency when copying a tree. Nodes can be shared with two
or more trees when copying a tree by using "dup" or "clone"
method.

Normally, tests for existing classes shold not be modified
except when changing specification or the test's bug, because
they guarantee specification of the class. Adding new tests
are OK.

If you really want nodes to have parent/children information
in each node, please do so in only PhyloXML classes (though
I'm negative).  In this case, the problem is that reading phyloxml
data and write back again seems good, but it seems there are
currently no way to convert Bio::Tree to PhyloXML. Now, it seems
hard to convert Newick data to PhyloXML.

Now, to prepare to include your PhyloXML code in BioRuby, I'm working
on my branch. Some API changes will be made.
http://github.com/ngoto/bioruby/tree/incoming

Note that in your test code, argument order of assert_equal is wrong.
I've already fixed in my branch.
http://github.com/ngoto/bioruby/commit/a291af62ef262ee04f3a0e1b6415d4e256c56a94

> * Updated  Tree::add_edge, Tree::clear_edge, Tree::remove_edge,
> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep
> track of Node::parent and Node::children nodes correctly.  Have I
> forgotten anything?

Changing root with tree.root=().

-- 
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org


> Hi all,
> 
> So finally I have updated Bio::Tree and Bio::Node classes to improve
> the phyloxml writer speed.
> 
> * Added Bio::Node::parent and  Bio::Node::children (array of nodes) in
> order to avoid calling Tree::parent(node) or Tree::children(node),
> because those methods call breath first search on the underlying
> graph, which makes PhyloXML writer and parser incredibly slow. In
> contrast, Bio::Node::parent and Bio::Node::children keeps references
> to the respective nodes.
> * Updated  Tree::add_edge, Tree::clear_edge, Tree::remove_edge,
> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep
> track of Node::parent and Node::children nodes correctly.  Have I
> forgotten anything?
> * Now for PhyloXML writer it takes less than 1 second instead of
> ~20minutes to write ncbi_taxonomy_mollusca.xml file 1.5MB
> * To write the tree of life taxonomy file (~46MB) it takes 10 seconds
> (On 2.4GHz, 2.9GB RAM, running Ubuntu)
> 
> The code is in http://github.com/latvianlinuxgirl/bioruby/tree/tree_class
> 
> I wrote unit tests for my changes and made sure my changes don't break
> anything else. However, does anybody has code laying around that uses
> Tree::parent and Tree::children methods so that I can test it more
> thoroughly?
> 
> Cheers,
> Diana
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From jan.aerts at gmail.com  Mon Nov 16 10:11:24 2009
From: jan.aerts at gmail.com (Jan Aerts)
Date: Mon, 16 Nov 2009 10:11:24 +0000
Subject: [BioRuby] BioRuby Phyloxml update
In-Reply-To: <20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp>
References: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com>
	<20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp>
Message-ID: <4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9@mail.gmail.com>

All,

I think we should make a good effort of merging Diana's code into the
bioruby codebase. Even though I'm not completely familiar with
bioruby's phylo implementation, an effort like hers should be welcomed
with open arms.

If her code speeds things up so immensely, why don't we start a new
branch that will lead to bioruby 2.0? Let bioruby 2.0 break things.
With a major new release things are allowed to be broken free from the
legacy code.

We definitely don't want Diana's efforts be in vain.

jan.

2009/11/8 Naohisa Goto <ngoto at gen-info.osaka-u.ac.jp>:
> Hi Diana,
>
> I'm sorry that the changes cannot be accepted, because the
> modification of existing Bio::Tree methods breaks things.
> Bio::Tree does not want to have children/parent information
> in nodes. One of the reasons is that it is difficult to keep
> consistency when copying a tree. Nodes can be shared with two
> or more trees when copying a tree by using "dup" or "clone"
> method.
>
> Normally, tests for existing classes shold not be modified
> except when changing specification or the test's bug, because
> they guarantee specification of the class. Adding new tests
> are OK.
>
> If you really want nodes to have parent/children information
> in each node, please do so in only PhyloXML classes (though
> I'm negative). ?In this case, the problem is that reading phyloxml
> data and write back again seems good, but it seems there are
> currently no way to convert Bio::Tree to PhyloXML. Now, it seems
> hard to convert Newick data to PhyloXML.
>
> Now, to prepare to include your PhyloXML code in BioRuby, I'm working
> on my branch. Some API changes will be made.
> http://github.com/ngoto/bioruby/tree/incoming
>
> Note that in your test code, argument order of assert_equal is wrong.
> I've already fixed in my branch.
> http://github.com/ngoto/bioruby/commit/a291af62ef262ee04f3a0e1b6415d4e256c56a94
>
>> * Updated ?Tree::add_edge, Tree::clear_edge, Tree::remove_edge,
>> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep
>> track of Node::parent and Node::children nodes correctly. ?Have I
>> forgotten anything?
>
> Changing root with tree.root=().
>
> --
> Naohisa Goto
> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
>
>
>> Hi all,
>>
>> So finally I have updated Bio::Tree and Bio::Node classes to improve
>> the phyloxml writer speed.
>>
>> * Added Bio::Node::parent and ?Bio::Node::children (array of nodes) in
>> order to avoid calling Tree::parent(node) or Tree::children(node),
>> because those methods call breath first search on the underlying
>> graph, which makes PhyloXML writer and parser incredibly slow. In
>> contrast, Bio::Node::parent and Bio::Node::children keeps references
>> to the respective nodes.
>> * Updated ?Tree::add_edge, Tree::clear_edge, Tree::remove_edge,
>> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep
>> track of Node::parent and Node::children nodes correctly. ?Have I
>> forgotten anything?
>> * Now for PhyloXML writer it takes less than 1 second instead of
>> ~20minutes to write ncbi_taxonomy_mollusca.xml file 1.5MB
>> * To write the tree of life taxonomy file (~46MB) it takes 10 seconds
>> (On 2.4GHz, 2.9GB RAM, running Ubuntu)
>>
>> The code is in http://github.com/latvianlinuxgirl/bioruby/tree/tree_class
>>
>> I wrote unit tests for my changes and made sure my changes don't break
>> anything else. However, does anybody has code laying around that uses
>> Tree::parent and Tree::children methods so that I can test it more
>> thoroughly?
>>
>> Cheers,
>> Diana
>> _______________________________________________
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>


From georgkam at gmail.com  Tue Nov 17 05:40:31 2009
From: georgkam at gmail.com (George Githinji)
Date: Tue, 17 Nov 2009 08:40:31 +0300
Subject: [BioRuby] BioRuby Digest, Vol 50, Issue 6
In-Reply-To: <mailman.15.1258390804.12843.bioruby@lists.open-bio.org>
References: <mailman.15.1258390804.12843.bioruby@lists.open-bio.org>
Message-ID: <55915f820911162140w592077f4o448d63e11b4300be@mail.gmail.com>

If Ruby itself is known to be slow compared to other interpreters, and
Diana;s code speeds up things, as a Bioruby user i would plead with the
developers to adopt her code in the next release with the speed
optimizations. The next release can only be better if the current code base
is overhauled and reviewed based on new developments like Diana's.

If Newick can be converted to a format which can then be converted to
PhyloXML, then conversion to newick is not a problem. Else I would question
the use of Newick format if it cannot be inter-converted to other file
formats.


On Mon, Nov 16, 2009 at 8:00 PM, <bioruby-request at lists.open-bio.org> wrote:

> Send BioRuby mailing list submissions to
>        bioruby at lists.open-bio.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://lists.open-bio.org/mailman/listinfo/bioruby
> or, via email, send a message with subject or body 'help' to
>        bioruby-request at lists.open-bio.org
>
> You can reach the person managing the list at
>        bioruby-owner at lists.open-bio.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of BioRuby digest..."
>
>
> Today's Topics:
>
>   1. Re: BioRuby Phyloxml update (Jan Aerts)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 16 Nov 2009 10:11:24 +0000
> From: Jan Aerts <jan.aerts at gmail.com>
> Subject: Re: [BioRuby] BioRuby Phyloxml update
> To: Naohisa Goto <ngoto at gen-info.osaka-u.ac.jp>
> Cc: phyloxml at yahoogroups.com, Pjotr Prins <pjotr2009 at thebird.nl>,
>        bioruby at lists.open-bio.org, Diana Jaunzeikare <djaunzei at smith.edu>
> Message-ID:
>        <4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> All,
>
> I think we should make a good effort of merging Diana's code into the
> bioruby codebase. Even though I'm not completely familiar with
> bioruby's phylo implementation, an effort like hers should be welcomed
> with open arms.
>
> If her code speeds things up so immensely, why don't we start a new
> branch that will lead to bioruby 2.0? Let bioruby 2.0 break things.
> With a major new release things are allowed to be broken free from the
> legacy code.
>
> We definitely don't want Diana's efforts be in vain.
>
> jan.
>
> 2009/11/8 Naohisa Goto <ngoto at gen-info.osaka-u.ac.jp>:
> > Hi Diana,
> >
> > I'm sorry that the changes cannot be accepted, because the
> > modification of existing Bio::Tree methods breaks things.
> > Bio::Tree does not want to have children/parent information
> > in nodes. One of the reasons is that it is difficult to keep
> > consistency when copying a tree. Nodes can be shared with two
> > or more trees when copying a tree by using "dup" or "clone"
> > method.
> >
> > Normally, tests for existing classes shold not be modified
> > except when changing specification or the test's bug, because
> > they guarantee specification of the class. Adding new tests
> > are OK.
> >
> > If you really want nodes to have parent/children information
> > in each node, please do so in only PhyloXML classes (though
> > I'm negative). ?In this case, the problem is that reading phyloxml
> > data and write back again seems good, but it seems there are
> > currently no way to convert Bio::Tree to PhyloXML. Now, it seems
> > hard to convert Newick data to PhyloXML.
> >
> > Now, to prepare to include your PhyloXML code in BioRuby, I'm working
> > on my branch. Some API changes will be made.
> > http://github.com/ngoto/bioruby/tree/incoming
> >
> > Note that in your test code, argument order of assert_equal is wrong.
> > I've already fixed in my branch.
> >
> http://github.com/ngoto/bioruby/commit/a291af62ef262ee04f3a0e1b6415d4e256c56a94
> >
> >> * Updated ?Tree::add_edge, Tree::clear_edge, Tree::remove_edge,
> >> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep
> >> track of Node::parent and Node::children nodes correctly. ?Have I
> >> forgotten anything?
> >
> > Changing root with tree.root=().
> >
> > --
> > Naohisa Goto
> > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
> >
> >
> >> Hi all,
> >>
> >> So finally I have updated Bio::Tree and Bio::Node classes to improve
> >> the phyloxml writer speed.
> >>
> >> * Added Bio::Node::parent and ?Bio::Node::children (array of nodes) in
> >> order to avoid calling Tree::parent(node) or Tree::children(node),
> >> because those methods call breath first search on the underlying
> >> graph, which makes PhyloXML writer and parser incredibly slow. In
> >> contrast, Bio::Node::parent and Bio::Node::children keeps references
> >> to the respective nodes.
> >> * Updated ?Tree::add_edge, Tree::clear_edge, Tree::remove_edge,
> >> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep
> >> track of Node::parent and Node::children nodes correctly. ?Have I
> >> forgotten anything?
> >> * Now for PhyloXML writer it takes less than 1 second instead of
> >> ~20minutes to write ncbi_taxonomy_mollusca.xml file 1.5MB
> >> * To write the tree of life taxonomy file (~46MB) it takes 10 seconds
> >> (On 2.4GHz, 2.9GB RAM, running Ubuntu)
> >>
> >> The code is in
> http://github.com/latvianlinuxgirl/bioruby/tree/tree_class
> >>
> >> I wrote unit tests for my changes and made sure my changes don't break
> >> anything else. However, does anybody has code laying around that uses
> >> Tree::parent and Tree::children methods so that I can test it more
> >> thoroughly?
> >>
> >> Cheers,
> >> Diana
> >> _______________________________________________
> >> BioRuby mailing list
> >> BioRuby at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioruby
> >
> > _______________________________________________
> > BioRuby mailing list
> > BioRuby at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioruby
> >
>
>
>
> ------------------------------
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>
>
> End of BioRuby Digest, Vol 50, Issue 6
> **************************************
>


-- 
---------------
Sincerely
George

Skype: george_g2
Blog: http://biorelated.wordpress.com/


From djaunzei at smith.edu  Tue Nov 17 14:52:59 2009
From: djaunzei at smith.edu (Diana Jaunzeikare)
Date: Tue, 17 Nov 2009 09:52:59 -0500
Subject: [BioRuby] BioRuby Phyloxml update
In-Reply-To: <4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9@mail.gmail.com>
References: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com> 
	<20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp>
	<4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9@mail.gmail.com>
Message-ID: <4057d3bf0911170652o3683768ax366df83afcfa48e0@mail.gmail.com>

Thanks for discussion. I see Naohisa's point that it is difficult to
keep consistency when copying a tree.

Right now PhyloXML class inherits from Bio::Tree class. Instead, I
could write a new general Bio::FamilyTree class (per Pjotr's
suggestion), which would be strictly a tree (I believe that Bio::Tree
allows for a node to have 2 parents) and would have parent/child
information. Thus it would not need underlying general graph
implementation, therefore making the implementation simpler than that
of Bio::Tree. Then PhyloXML::Tree would inherit from Bio::FamilyTree.
This way PhyloXML writer probably would be even faster because it
would not need to update Bio::Pathway structure (which is under
Bio::Tree) every time adding a node or edge.
Additionally, I think BioRuby would benefit from general
Bio::FamilyTree class. I recently heard a talk by researcher who did
phylogenetic analysis of musical rhythms.

Also I will write method to convert from newick to PhyloXML.

What do you think?

Cheers,
Diana

On Mon, Nov 16, 2009 at 5:11 AM, Jan Aerts <jan.aerts at gmail.com> wrote:
> All,
>
> I think we should make a good effort of merging Diana's code into the
> bioruby codebase. Even though I'm not completely familiar with
> bioruby's phylo implementation, an effort like hers should be welcomed
> with open arms.
>
> If her code speeds things up so immensely, why don't we start a new
> branch that will lead to bioruby 2.0? Let bioruby 2.0 break things.
> With a major new release things are allowed to be broken free from the
> legacy code.
>
> We definitely don't want Diana's efforts be in vain.
>
> jan.
>
> 2009/11/8 Naohisa Goto <ngoto at gen-info.osaka-u.ac.jp>:
>> Hi Diana,
>>
>> I'm sorry that the changes cannot be accepted, because the
>> modification of existing Bio::Tree methods breaks things.
>> Bio::Tree does not want to have children/parent information
>> in nodes. One of the reasons is that it is difficult to keep
>> consistency when copying a tree. Nodes can be shared with two
>> or more trees when copying a tree by using "dup" or "clone"
>> method.
>>
>> Normally, tests for existing classes shold not be modified
>> except when changing specification or the test's bug, because
>> they guarantee specification of the class. Adding new tests
>> are OK.
>>
>> If you really want nodes to have parent/children information
>> in each node, please do so in only PhyloXML classes (though
>> I'm negative). ?In this case, the problem is that reading phyloxml
>> data and write back again seems good, but it seems there are
>> currently no way to convert Bio::Tree to PhyloXML. Now, it seems
>> hard to convert Newick data to PhyloXML.
>>
>> Now, to prepare to include your PhyloXML code in BioRuby, I'm working
>> on my branch. Some API changes will be made.
>> http://github.com/ngoto/bioruby/tree/incoming
>>
>> Note that in your test code, argument order of assert_equal is wrong.
>> I've already fixed in my branch.
>> http://github.com/ngoto/bioruby/commit/a291af62ef262ee04f3a0e1b6415d4e256c56a94
>>
>>> * Updated ?Tree::add_edge, Tree::clear_edge, Tree::remove_edge,
>>> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep
>>> track of Node::parent and Node::children nodes correctly. ?Have I
>>> forgotten anything?
>>
>> Changing root with tree.root=().
>>
>> --
>> Naohisa Goto
>> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
>>
>>
>>> Hi all,
>>>
>>> So finally I have updated Bio::Tree and Bio::Node classes to improve
>>> the phyloxml writer speed.
>>>
>>> * Added Bio::Node::parent and ?Bio::Node::children (array of nodes) in
>>> order to avoid calling Tree::parent(node) or Tree::children(node),
>>> because those methods call breath first search on the underlying
>>> graph, which makes PhyloXML writer and parser incredibly slow. In
>>> contrast, Bio::Node::parent and Bio::Node::children keeps references
>>> to the respective nodes.
>>> * Updated ?Tree::add_edge, Tree::clear_edge, Tree::remove_edge,
>>> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep
>>> track of Node::parent and Node::children nodes correctly. ?Have I
>>> forgotten anything?
>>> * Now for PhyloXML writer it takes less than 1 second instead of
>>> ~20minutes to write ncbi_taxonomy_mollusca.xml file 1.5MB
>>> * To write the tree of life taxonomy file (~46MB) it takes 10 seconds
>>> (On 2.4GHz, 2.9GB RAM, running Ubuntu)
>>>
>>> The code is in http://github.com/latvianlinuxgirl/bioruby/tree/tree_class
>>>
>>> I wrote unit tests for my changes and made sure my changes don't break
>>> anything else. However, does anybody has code laying around that uses
>>> Tree::parent and Tree::children methods so that I can test it more
>>> thoroughly?
>>>
>>> Cheers,
>>> Diana
>>> _______________________________________________
>>> BioRuby mailing list
>>> BioRuby at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioruby
>>
>> _______________________________________________
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
>>
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>


From ngoto at gen-info.osaka-u.ac.jp  Tue Nov 17 16:27:46 2009
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Wed, 18 Nov 2009 01:27:46 +0900
Subject: [BioRuby] BioRuby Phyloxml update
In-Reply-To: <4057d3bf0911170652o3683768ax366df83afcfa48e0@mail.gmail.com>
References: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com>
	<20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp>
	<4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9@mail.gmail.com>
	<4057d3bf0911170652o3683768ax366df83afcfa48e0@mail.gmail.com>
Message-ID: <20091117162747.C1CD81CBC41B@idnmail.gen-info.osaka-u.ac.jp>

Hi,

I've just committed speed-up of Bio::Tree#children in my repository.
It keeps compatibility. Trade-off for the speed-up, memory consumption
is a little bit larger than the previous code.
http://github.com/ngoto/bioruby

For the benchmark of reading and writing big PhyloXML code, based
on Diana's test_phyloxml_big.rb, a new sample code is added
as sample/test_phyloxml_big.rb.

Running the new sample/test_phyloxml_big.rb on a machine
(Pentium D 3.40GHz, memory 4GB, running Debian GNU/Linux)
with http://github.com/ngoto/bioruby:
47.52user 0.93system 0:50.09elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+141424outputs (0major+167550minor)pagefaults 0swaps

with http://github.com/latvianlinuxgirl/bioruby/tree/tree_class
43.55user 1.00system 0:46.59elapsed 95%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+141424outputs (0major+165151minor)pagefaults 0swaps

Although my new code is still ~10% slower than Diana's new code,
I think it can be acceptable because my code keeps compatibility.

I wrote Bio::Tree because I want to manipulate trees flexibly,
e.g. merging and splitting trees, changing root of trees.
For the purpose, I didn't take the way to have parent/children
in a node.

I also think the current Bio::Tree is not the best. One of the
weak points is it is relatively heavy. The flexibility may
not be needed for parsers only representing fixed data structure.
New class seems attractive for usages that can not be coverd with
the current Bio::Tree implementation.

Thanks,

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

On Tue, 17 Nov 2009 09:52:59 -0500
Diana Jaunzeikare <djaunzei at smith.edu> wrote:

> Thanks for discussion. I see Naohisa's point that it is difficult to
> keep consistency when copying a tree.
> 
> Right now PhyloXML class inherits from Bio::Tree class. Instead, I
> could write a new general Bio::FamilyTree class (per Pjotr's
> suggestion), which would be strictly a tree (I believe that Bio::Tree
> allows for a node to have 2 parents) and would have parent/child
> information. Thus it would not need underlying general graph
> implementation, therefore making the implementation simpler than that
> of Bio::Tree. Then PhyloXML::Tree would inherit from Bio::FamilyTree.
> This way PhyloXML writer probably would be even faster because it
> would not need to update Bio::Pathway structure (which is under
> Bio::Tree) every time adding a node or edge.
> Additionally, I think BioRuby would benefit from general
> Bio::FamilyTree class. I recently heard a talk by researcher who did
> phylogenetic analysis of musical rhythms.
> 
> Also I will write method to convert from newick to PhyloXML.
> 
> What do you think?
> 
> Cheers,
> Diana


From tomoakin at kenroku.kanazawa-u.ac.jp  Wed Nov 18 00:24:34 2009
From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA)
Date: Wed, 18 Nov 2009 09:24:34 +0900
Subject: [BioRuby] BioRuby Phyloxml update
In-Reply-To: <20091117162747.C1CD81CBC41B@idnmail.gen-info.osaka-u.ac.jp>
References: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com>
	<20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp>
	<4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9@mail.gmail.com>
	<4057d3bf0911170652o3683768ax366df83afcfa48e0@mail.gmail.com>
	<20091117162747.C1CD81CBC41B@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <A11D33B2-3F03-4278-86DA-F7428356845E@kenroku.kanazawa-u.ac.jp>

Hi,

One point seems that tree can be unrooted or rooted.
Perhaps, Goto-san's Bio::Tree represents unrooted tree (not  
distinguishing parents and childrenn),
while Diana's class is for rooted trees (having distinction of  
parents and children).
If, this is the point, Bio::RootedTree is better name than  
Bio::FamilyTree.
In general, rooted tree should be easily converted to unrooted tree,  
while
conversion of an unrooted tree to rooted tree requires specification  
of the root.

For text representation like NEWICK there is anyway a root while
the tree can be interpreted either as rooted or unrooted.

It could be good to have distinct interface for rooted and unrooted  
trees,
to let the user's be aware of the conceptual difference.
-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan


From tomoakin at kenroku.kanazawa-u.ac.jp  Thu Nov 19 00:33:32 2009
From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA)
Date: Thu, 19 Nov 2009 09:33:32 +0900
Subject: [BioRuby] Blast to Phylogeny
In-Reply-To: <4B045622.8040204@broadinstitute.org>
References: <4057d3bf0911071950x371798cpfa0c2dd778dedf98@mail.gmail.com>	<20091108215045.08CF.EEF6E030@gen-info.osaka-u.ac.jp>	<4c7507a70911160211v417fd748u4bf2dd5f2f14c4c9@mail.gmail.com>	<4057d3bf0911170652o3683768ax366df83afcfa48e0@mail.gmail.com>	<20091117162747.C1CD81CBC41B@idnmail.gen-info.osaka-u.ac.jp>
	<A11D33B2-3F03-4278-86DA-F7428356845E@kenroku.kanazawa-u.ac.jp>
	<4B045622.8040204@broadinstitute.org>
Message-ID: <EE017307-BE83-456E-B08D-46CBC4CFDF66@kenroku.kanazawa-u.ac.jp>

Hi,

In general, to construct a phylogenetic tree from molecular sequence  
data,
you will collect the homologous sequences, perform multiple alignment,
identify the region that will be used for the reconstruction,
and then pass the data to an appropriate program to reconstruct the  
phylogeny.

If I have a BLAST output, I would parse that file with Bio::FlatFile and
extract the identifiers of the hit sequences, use the identifiers to  
collect
individual sequences and submit the sequences to mafft for multiple  
alignment.
Convert the alignment to nexus format and manually check with  
MacClade, and then
parse the edited nexus file to write the multiple alignment readable  
by the phylogenetic
analysis program. There are many options you can take at each step.

So, there are multiple ways, but not a single simple way. :(

Bioruby has support for multiple alignment programs like mafft,  
muscle, and clustalw.
For phylogenetic reconstruction, there is some support for phylip and  
paml
(I don't have tried these feature from Bioruby library, though).
There are a number of programs for phylogenetic analysis other than  
phylip and paml.
A list compiled by J. Felsenstein is available at
http://evolution.genetics.washington.edu/phylip/software.html

An alignment similar to that of phylip will be accepted by most  
programs.
-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan


On 2009/11/19, at 5:16, Sharvari Gujja wrote:

> Hi,
>
> I am trying to construct a phylogenetic tree from Blast  
> output...Could you please let me know if there is a way to do  
> this..I have also been looking at Bio::Tree documentation but it is  
> not clear if it accepts Blast file as input.
>
> Appreciate any help.
>
> Thanks
> Sharvari


From robert.citek at gmail.com  Thu Nov 19 20:06:22 2009
From: robert.citek at gmail.com (Robert Citek)
Date: Thu, 19 Nov 2009 15:06:22 -0500
Subject: [BioRuby] custom blast scoring matrix
Message-ID: <4145b6790911191206r53c86818m280e3a149f9293ec@mail.gmail.com>

Hello all,

I would like to create a custom BLAST scoring matrix that I can use
with NCBI's blastall.  For example, let's say I want to create a
modified BLOSUM62 matrix called BLOSUM62ar, where the A:R score is now
2 instead of -1.

Some questions that I have:

1) is this possible?
2) if it is, where can I find documentation which describes how to do this?
3) is the blast output different from a regular blast?
4) if it is different, does bio-ruby have blast parsers that can parse
the output?

Thanks in advance for any pointers and suggestions.

Regards,
- Robert


From georgkam at gmail.com  Sat Nov 21 08:58:53 2009
From: georgkam at gmail.com (George Githinji)
Date: Sat, 21 Nov 2009 11:58:53 +0300
Subject: [BioRuby] custom blast scoring matrix
Message-ID: <55915f820911210058g6e089f76wc5c9cbfdad3e9966@mail.gmail.com>

Hi Martin,
Thanks for bringing the topic on list. Sometimes back i was also very
interested in custom matrices for NCBI blast.
Making custom Matrices is possible. check this out
BMC Bioinformatics 2008, 9:236 doi:10.1186/1471-2105-9-236

However making your matrices work with NCBI blast is slightly difficult as
you need to recompile the BLAST program and incoporate your modifications. I
found this a little bit not so straighforward. Lack of good documentation.

I wonder whether there is someone who has implemented the BLAST algorithm in
Ruby. (The argument is usually that the C implementation is very optimized
and good, so why would one want to implement it in ruby?) though i would not
buy that argument for learning purposes.  The closest i came to a BLAST
algorithm is an implementation of it in Perl, in the book Genomic Perl by
Rex A. Dwyer, He also outlines how to create your own matrices with code
listings in perl.

Please ping me back if you get more resources. :)
George


On Fri, Nov 20, 2009 at 8:00 PM, <bioruby-request at lists.open-bio.org> wrote:

> Send BioRuby mailing list submissions to
>        bioruby at lists.open-bio.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://lists.open-bio.org/mailman/listinfo/bioruby
> or, via email, send a message with subject or body 'help' to
>        bioruby-request at lists.open-bio.org
>
> You can reach the person managing the list at
>        bioruby-owner at lists.open-bio.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of BioRuby digest..."
>
>
> Today's Topics:
>
>   1. custom blast scoring matrix (Robert Citek)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 19 Nov 2009 15:06:22 -0500
> From: Robert Citek <robert.citek at gmail.com>
> Subject: [BioRuby] custom blast scoring matrix
> To: bioruby <bioruby at lists.open-bio.org>
> Message-ID:
>        <4145b6790911191206r53c86818m280e3a149f9293ec at mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
>
> Hello all,
>
> I would like to create a custom BLAST scoring matrix that I can use
> with NCBI's blastall.  For example, let's say I want to create a
> modified BLOSUM62 matrix called BLOSUM62ar, where the A:R score is now
> 2 instead of -1.
>
> Some questions that I have:
>
> 1) is this possible?
> 2) if it is, where can I find documentation which describes how to do this?
> 3) is the blast output different from a regular blast?
> 4) if it is different, does bio-ruby have blast parsers that can parse
> the output?
>
> Thanks in advance for any pointers and suggestions.
>
> Regards,
> - Robert
>
>
> ------------------------------
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>
>
> End of BioRuby Digest, Vol 50, Issue 10
> ***************************************
>


-- 
---------------
Sincerely
George

Skype: george_g2
Blog: http://biorelated.wordpress.com/


From robert.citek at gmail.com  Sun Nov 22 13:55:58 2009
From: robert.citek at gmail.com (Robert Citek)
Date: Sun, 22 Nov 2009 08:55:58 -0500
Subject: [BioRuby] custom blast scoring matrix
In-Reply-To: <55915f820911210058g6e089f76wc5c9cbfdad3e9966@mail.gmail.com>
References: <55915f820911210058g6e089f76wc5c9cbfdad3e9966@mail.gmail.com>
Message-ID: <4145b6790911220555q410187fak9f8b1b66e4a0ddf2@mail.gmail.com>

On Sat, Nov 21, 2009 at 3:58 AM, George Githinji <georgkam at gmail.com> wrote:
> Thanks for bringing the topic on list. Sometimes back i was also very
> interested in custom matrices for NCBI blast.
> Making custom Matrices is possible. check this out
> BMC Bioinformatics 2008, 9:236 doi:10.1186/1471-2105-9-236

Thanks for the citation.  I'll have a look into that.

> However making your matrices work with NCBI blast is slightly difficult as
> you need to recompile the BLAST program and incoporate your modifications. I
> found this a little bit not so straighforward. Lack of good documentation.

That's unfortunate.  I've tried compiling NCBI blast a few times in
the past and don't ever recall having success with it, running into
the same issues you describe.  But it's been a while and maybe the
process has become easier.  I'll give it a whirl.

> I wonder whether there is someone who has implemented the BLAST algorithm in
> Ruby. (The argument is usually that the C implementation is very optimized
> and good, so why would one want to implement it in ruby?) though i would not
> buy that argument for learning purposes. ?The closest i came to a BLAST
> algorithm is an implementation of it in Perl, in the book Genomic Perl by
> Rex A. Dwyer, He also outlines how to create your own matrices with code
> listings in perl.

Thanks.  I'll have a look at that as well.

> Please ping me back if you get more resources. :)

Will do.

Regards,
- Robert


From pjotr.public14 at thebird.nl  Thu Nov 26 13:08:30 2009
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Thu, 26 Nov 2009 14:08:30 +0100
Subject: [BioRuby] Ruby EMBOSS mapping (using Biolib)
Message-ID: <20091126130830.GA19003@thebird.nl>

Hi all,

The last year I have been working on C library mappings to Ruby. A
comparison of Bioruby against Biolib/EMBOSS six frame translation of a
C.elegans dataset shows the Ruby with EMBOSS version is about 30x
faster. On my (outdated) machine:

Bioruby version:

  22929 records 137574 times translated!
   real    9m30.952s
   user    8m42.877s
   sys     0m32.878s

Biolib version:

  22929 records 137574 times translated!
   real    0m20.306s
   user    0m15.997s
   sys     0m1.344s

This is including IO - which is handled by Ruby. 

The Bioruby code reads:

  nt = FastaReader.new(fn)
  nt.each { | rec |
      seq = Bio::Sequence::NA.new(rec.seq)
      [-3,-2,-1,1,2,3].each do | frame |
        print "> ",rec.id," ",frame.to_s,"\n"
        print seq.translate(frame),"\n"
      end
  }
  $stderr.print nt.size," records ",nt.size*6*iter," times translated!"

The Biolib code reads

  nt = FastaReader.new(fn)
  trnTable = Biolib::Emboss.ajTrnNewI(1);
  nt.each { | rec |
      ajpseq   = Biolib::Emboss.ajSeqNewNameC(rec.seq,"Test sequence")
      [-3,-2,-1,1,2,3].each do | frame |
        ajpseqt  = Biolib::Emboss.ajTrnSeqOrig(trnTable,ajpseq,frame)
        aa       = Biolib::Emboss.ajSeqGetSeqCopyC(ajpseqt)
        print "> ",rec.id," ",frame.to_s,"\n"
        print aa,"\n"
      end
  }
  $stderr.print nt.size," records ",nt.size*6*iter," times translated!"

A write up of the mapping effort is at:

  http://biolib.open-bio.org/wiki/Mapping_EMBOSS


From pjotr.public14 at thebird.nl  Thu Nov 26 13:44:27 2009
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Thu, 26 Nov 2009 14:44:27 +0100
Subject: [BioRuby] Announcing BigBio project for Ruby
Message-ID: <20091126134427.GA20660@thebird.nl>

BigBio = BIG DATA computing (for Ruby)

BigBio is an initiative to a create high performance libraries for big data
computing in biology - initially for the Ruby language. 

The Ruby version of BioBig uses BioRuby, when sensible, but provides an
interface with a different design. Also, unlike BioRuby which aims to be pure
Ruby, it uses BioLib C/C++ functions for increased performance and reduced
memory consumption.

The first module is an (indexed) FastaReader which does not load the
full FASTA file in memory. 

http://github.com/pjotrp/bigbio

Pj.


From jan.aerts at gmail.com  Thu Nov 26 13:44:58 2009
From: jan.aerts at gmail.com (Jan Aerts)
Date: Thu, 26 Nov 2009 13:44:58 +0000
Subject: [BioRuby] VCF
Message-ID: <4c7507a70911260544j4ba5f089y38c76d4f48131258@mail.gmail.com>

Is anyone working on a VCF (Variant Call Format) parser in bioruby?
http://1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcfv3.2


From jan.aerts at gmail.com  Thu Nov 26 13:46:52 2009
From: jan.aerts at gmail.com (Jan Aerts)
Date: Thu, 26 Nov 2009 13:46:52 +0000
Subject: [BioRuby] Announcing BigBio project for Ruby
In-Reply-To: <20091126134427.GA20660@thebird.nl>
References: <20091126134427.GA20660@thebird.nl>
Message-ID: <4c7507a70911260546w45839e7fra4a2565a66bc47ff@mail.gmail.com>

Interesting... Planning to incorporate SAM/BAM alignment formats for
nextgen sequences?

jan.

2009/11/26 Pjotr Prins <pjotr.public14 at thebird.nl>:
> BigBio = BIG DATA computing (for Ruby)
>
> BigBio is an initiative to a create high performance libraries for big data
> computing in biology - initially for the Ruby language.
>
> The Ruby version of BioBig uses BioRuby, when sensible, but provides an
> interface with a different design. Also, unlike BioRuby which aims to be pure
> Ruby, it uses BioLib C/C++ functions for increased performance and reduced
> memory consumption.
>
> The first module is an (indexed) FastaReader which does not load the
> full FASTA file in memory.
>
> http://github.com/pjotrp/bigbio
>
> Pj.
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>


From jan.aerts at gmail.com  Thu Nov 26 13:52:16 2009
From: jan.aerts at gmail.com (Jan Aerts)
Date: Thu, 26 Nov 2009 13:52:16 +0000
Subject: [BioRuby] Bio::DB::Sam
Message-ID: <4c7507a70911260552u5ff78223r4905a27850e20179@mail.gmail.com>

And another parser that probably should be added to bioruby: something
to interact with SAM/BAM files (which contain mapping positions for
short reads). More info at samtools.sourceforge.net

Lincoln has written a nice API for perl: Bio::DB::Sam. Maybe we should
go for something similar?
http://search.cpan.org/~lds/Bio-SamTools-1.07/lib/Bio/DB/Sam.pm

Is anyone already working on this?
jan.


From pjotr.public14 at thebird.nl  Thu Nov 26 14:17:03 2009
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Thu, 26 Nov 2009 15:17:03 +0100
Subject: [BioRuby] Bio::DB::Sam
In-Reply-To: <4c7507a70911260552u5ff78223r4905a27850e20179@mail.gmail.com>
References: <4c7507a70911260552u5ff78223r4905a27850e20179@mail.gmail.com>
Message-ID: <20091126141703.GA21032@thebird.nl>

On Thu, Nov 26, 2009 at 01:52:16PM +0000, Jan Aerts wrote:
> And another parser that probably should be added to bioruby: something
> to interact with SAM/BAM files (which contain mapping positions for
> short reads). More info at samtools.sourceforge.net

by the looks of it - it should be relatively easy with SWIG - and
therefore Biolib.

> Lincoln has written a nice API for perl: Bio::DB::Sam. Maybe we should
> go for something similar?
> http://search.cpan.org/~lds/Bio-SamTools-1.07/lib/Bio/DB/Sam.pm

Wow, this guy is hard core! Doing this with PerlXS takes a *lot* of
effort. XS is sooooo nineties ;-)

> Is anyone already working on this?

I am happy to write a SWIG mapper. If someone really cares to use
it and will write the higher-level Ruby interface (nice OOP class
representation). 

I have been told Bioruby is pure Ruby - so this will not fit in.

Pj.


From biopython at maubp.freeserve.co.uk  Thu Nov 26 16:02:50 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 26 Nov 2009 16:02:50 +0000
Subject: [BioRuby] Fwd: [DAS] DAS workshop 7th-9th April 2010
In-Reply-To: <F30A9ED7-41E9-4833-A094-FDF0893E0F92@sanger.ac.uk>
References: <F30A9ED7-41E9-4833-A094-FDF0893E0F92@sanger.ac.uk>
Message-ID: <320fb6e00911260802wb98b28fic8a193c125e29d9c@mail.gmail.com>

This might be of interest to some of you.

Peter

---------- Forwarded message ----------
From: Jonathan Warren <jw12 at sanger.ac.uk>
Date: Thu, Nov 26, 2009 at 2:57 PM
Subject: [DAS] DAS workshop 7th-9th April 2010
To: das at biodas.org, das_registry_announce at sanger.ac.uk, biojava-dev
<biojava-dev at biojava.org>, BioJava <biojava-l at biojava.org>, BioPerl
<bioperl-l at lists.open-bio.org>, all at sanger.ac.uk, all at ebi.ac.uk,
ensembldev <ensembl-dev at ebi.ac.uk>


We are considering running a Distributed Annotation System workshop
here at the Sanger/EBI in the UK subject to decent demand.
The workshop will be held from Wednesday 7th-Friday 9th April 2010. If
you would be interested in attending either to present or just take
part
then please email me jw12 at sanger.ac.uk

The format of the workshop is likely to be similar to last years (1st
day for beginners, 2nd for both beginners and advanced users, 3rd day
for advanced), information for which can be found here:
http://www.dasregistry.org/course.jsp

If you would like to present then please send a short summary of what
you would like to talk about.

Thanks

Jonathan.

Jonathan Warren
Senior Developer and DAS coordinator
jw12 at sanger.ac.uk


--
The Wellcome Trust Sanger Institute is operated by Genome
ResearchLimited, a charity registered in England with number 1021457
and acompany registered in England with number 2742969, whose
registeredoffice is 215 Euston Road, London, NW1
2BE._______________________________________________
DAS mailing list
DAS at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/das


From josejotero at gmail.com  Sat Nov 28 02:55:38 2009
From: josejotero at gmail.com (Jose Otero)
Date: Fri, 27 Nov 2009 18:55:38 -0800
Subject: [BioRuby] Bio::GenBank
Message-ID: <baa601bb0911271855k2845010buf33e14fb68def8c3@mail.gmail.com>

Hello all,
I'm new to BioRUby and I am trying to adapt  the BioGenbank class to store
information of my plasmid database.
Question 1:  Does anybody know how to insert a nucleic acid sequence as the
value to 'sequence' in the @data object?
Placing in inoformation as Bio::Feature::Qualifier objects is easy, as is
inserting Bio::Locus information.  But I can't figure how to insert the
sequence data.
Question 2:  Has anybody ever changed the data from a BioGenbank object and
save the altered file?  This would be very interesting for my plasmid
database.

Thanks for the help.
JO


From ngoto at gen-info.osaka-u.ac.jp  Sat Nov 28 09:00:01 2009
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Sat, 28 Nov 2009 18:00:01 +0900
Subject: [BioRuby] Bio::GenBank
In-Reply-To: <baa601bb0911271855k2845010buf33e14fb68def8c3@mail.gmail.com>
References: <baa601bb0911271855k2845010buf33e14fb68def8c3@mail.gmail.com>
Message-ID: <20091128090002.372041CBC49E@idnmail.gen-info.osaka-u.ac.jp>

Hello Jose,

On Fri, 27 Nov 2009 18:55:38 -0800
Jose Otero <josejotero at gmail.com> wrote:

> Hello all,
> I'm new to BioRUby and I am trying to adapt  the BioGenbank class to store
> information of my plasmid database.
> Question 1:  Does anybody know how to insert a nucleic acid sequence as the
> value to 'sequence' in the @data object?
> Placing in inoformation as Bio::Feature::Qualifier objects is easy, as is
> inserting Bio::Locus information.  But I can't figure how to insert the
> sequence data.

Once an object of the Bio::GenBank class is created, each data stored
in the object is intended to be read-only, though modification is not
explicitly prohibited. This is because the class is designed for
efficient parsing of the GenBank formatted text, and it is technically
not easy to achieve both efficient parsing and flexible modification.
(This is also applied to most parser classes, e.g. Bio::EMBL, Bio::SPTR,
etc.)

In your case, using Bio::Sequence seems the best way. After converted
to Bio::Sequence object, from a Bio::GenBank object, it can be freely
modified.

  # Assume str contains GenBank formatted text as String.
  #
  # Creating a new Bio::GenBank object.
  gb = Bio::GenBank.new(str)

  # Converting to Bio::Sequence object
  s = gb.to_biosequence

  # Modifying the sequence.
  #
  # Note that other attributes, such as features and references
  # (which depend on locations on the sequence) are kept unchanged.
  # Relocation of the features, references, etc. is relied on the
  # user.
  # 
  s.seq = 'atgc' * 10 + s.seq

  # Text formatting as the GenBank format.
  puts s.output(:genbank)

Creating a new Bio::Sequence object from scratch, giving definition,
accessions, keywords, references, features, etc., and getting
GenBank-formatted text can also be done.

> Question 2:  Has anybody ever changed the data from a BioGenbank object and
> save the altered file?  This would be very interesting for my plasmid
> database.

As described above, Bio::Sequence#output can be used. The method returns
formatted text as String, and you can easily write it to a file.

> Thanks for the help.
> JO


Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org