From Yannick.Wurm at unil.ch  Sun Sep 24 09:28:53 2006
From: Yannick.Wurm at unil.ch (Yannick Wurm)
Date: Sun, 24 Sep 2006 15:28:53 +0200
Subject: [BioRuby] Blast parsing speed
Message-ID: <4A15B31A-2924-42D1-BC4B-354145686CCB@unil.ch>

Hi,
I have been happily using bioruby for the past year or so for my post- 
blast analyses. Occasionally, I will have ~ 1gb blast result files  
that need to be parsed. Here my machine may start paging and slows to  
a crawl.

Thus I wonder:
	- has anyone benchmarked bioruby, bioperl, biojava, biopython when  
processing the same file to compare speed and memory usage?
	- For the sake of future compatibility, I have been use blast's xml  
output. How much slower is it is to parse such an xml file relative  
to a "normal" or tabular blast output?

Cheers,

Yannick

--------------------------------------------
          yannick . wurm @ unil . ch
Ant Genomics, Ecology & Evolution @ Lausanne
   http://www.unil.ch/dee/page28685_fr.html


From mmhohman at northwestern.edu  Wed Sep 27 02:18:45 2006
From: mmhohman at northwestern.edu (Moses M. Hohman)
Date: Tue, 26 Sep 2006 23:18:45 -0700
Subject: [BioRuby] Blast parsing speed
In-Reply-To: <4A15B31A-2924-42D1-BC4B-354145686CCB@unil.ch>
References: <4A15B31A-2924-42D1-BC4B-354145686CCB@unil.ch>
Message-ID: <0700FC44-29EF-4F77-985A-C5D2841ABF4D@northwestern.edu>

Hi Yannick,

Sounds like bioruby is reading the entire DOM tree of the blast  
output XML into memory (with all the paging, etc.). That looks like  
what's happening in bio/appl/blast/rexml.rb. It looks like if you  
have the xmlparser library installed (http://raa.ruby-lang.org/ 
project/xmlparser/), which is a SAX parser, it will use that, and  
that should solve you problem.

We might want to look into using a pull parser instead of a DOM  
parser, i.e. in Ruby use rexml/parsers/pullparser instead of the  
rexml/document. Pull parsers are nice because they are as memory- 
efficient as SAX parsers but allow you to use a more familiar  
procedural programming style rather than an event-driven style (like  
in xmlparser).

So, it's less an issue of the programming language, and more of the  
type of XML parser.

Hope that helps, it's a guess but I think it's probably what you're  
encountering,

Moses

On Sep 24, 2006, at 6:28 AM, Yannick Wurm wrote:

> Hi,
> I have been happily using bioruby for the past year or so for my post-
> blast analyses. Occasionally, I will have ~ 1gb blast result files
> that need to be parsed. Here my machine may start paging and slows to
> a crawl.
>
> Thus I wonder:
> 	- has anyone benchmarked bioruby, bioperl, biojava, biopython when
> processing the same file to compare speed and memory usage?
> 	- For the sake of future compatibility, I have been use blast's xml
> output. How much slower is it is to parse such an xml file relative
> to a "normal" or tabular blast output?
>
> Cheers,
>
> Yannick
>
> --------------------------------------------
>           yannick . wurm @ unil . ch
> Ant Genomics, Ecology & Evolution @ Lausanne
>    http://www.unil.ch/dee/page28685_fr.html
>
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>


From Yannick.Wurm at unil.ch  Sun Sep 24 13:28:53 2006
From: Yannick.Wurm at unil.ch (Yannick Wurm)
Date: Sun, 24 Sep 2006 15:28:53 +0200
Subject: [BioRuby] Blast parsing speed
Message-ID: <4A15B31A-2924-42D1-BC4B-354145686CCB@unil.ch>

Hi,
I have been happily using bioruby for the past year or so for my post- 
blast analyses. Occasionally, I will have ~ 1gb blast result files  
that need to be parsed. Here my machine may start paging and slows to  
a crawl.

Thus I wonder:
	- has anyone benchmarked bioruby, bioperl, biojava, biopython when  
processing the same file to compare speed and memory usage?
	- For the sake of future compatibility, I have been use blast's xml  
output. How much slower is it is to parse such an xml file relative  
to a "normal" or tabular blast output?

Cheers,

Yannick

--------------------------------------------
          yannick . wurm @ unil . ch
Ant Genomics, Ecology & Evolution @ Lausanne
   http://www.unil.ch/dee/page28685_fr.html


From mmhohman at northwestern.edu  Wed Sep 27 06:18:45 2006
From: mmhohman at northwestern.edu (Moses M. Hohman)
Date: Tue, 26 Sep 2006 23:18:45 -0700
Subject: [BioRuby] Blast parsing speed
In-Reply-To: <4A15B31A-2924-42D1-BC4B-354145686CCB@unil.ch>
References: <4A15B31A-2924-42D1-BC4B-354145686CCB@unil.ch>
Message-ID: <0700FC44-29EF-4F77-985A-C5D2841ABF4D@northwestern.edu>

Hi Yannick,

Sounds like bioruby is reading the entire DOM tree of the blast  
output XML into memory (with all the paging, etc.). That looks like  
what's happening in bio/appl/blast/rexml.rb. It looks like if you  
have the xmlparser library installed (http://raa.ruby-lang.org/ 
project/xmlparser/), which is a SAX parser, it will use that, and  
that should solve you problem.

We might want to look into using a pull parser instead of a DOM  
parser, i.e. in Ruby use rexml/parsers/pullparser instead of the  
rexml/document. Pull parsers are nice because they are as memory- 
efficient as SAX parsers but allow you to use a more familiar  
procedural programming style rather than an event-driven style (like  
in xmlparser).

So, it's less an issue of the programming language, and more of the  
type of XML parser.

Hope that helps, it's a guess but I think it's probably what you're  
encountering,

Moses

On Sep 24, 2006, at 6:28 AM, Yannick Wurm wrote:

> Hi,
> I have been happily using bioruby for the past year or so for my post-
> blast analyses. Occasionally, I will have ~ 1gb blast result files
> that need to be parsed. Here my machine may start paging and slows to
> a crawl.
>
> Thus I wonder:
> 	- has anyone benchmarked bioruby, bioperl, biojava, biopython when
> processing the same file to compare speed and memory usage?
> 	- For the sake of future compatibility, I have been use blast's xml
> output. How much slower is it is to parse such an xml file relative
> to a "normal" or tabular blast output?
>
> Cheers,
>
> Yannick
>
> --------------------------------------------
>           yannick . wurm @ unil . ch
> Ant Genomics, Ecology & Evolution @ Lausanne
>    http://www.unil.ch/dee/page28685_fr.html
>
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>