From Yannick.Wurm at unil.ch Sun Sep 24 09:28:53 2006 From: Yannick.Wurm at unil.ch (Yannick Wurm) Date: Sun, 24 Sep 2006 15:28:53 +0200 Subject: [BioRuby] Blast parsing speed Message-ID: <4A15B31A-2924-42D1-BC4B-354145686CCB@unil.ch> Hi, I have been happily using bioruby for the past year or so for my post- blast analyses. Occasionally, I will have ~ 1gb blast result files that need to be parsed. Here my machine may start paging and slows to a crawl. Thus I wonder: - has anyone benchmarked bioruby, bioperl, biojava, biopython when processing the same file to compare speed and memory usage? - For the sake of future compatibility, I have been use blast's xml output. How much slower is it is to parse such an xml file relative to a "normal" or tabular blast output? Cheers, Yannick -------------------------------------------- yannick . wurm @ unil . ch Ant Genomics, Ecology & Evolution @ Lausanne http://www.unil.ch/dee/page28685_fr.html From mmhohman at northwestern.edu Wed Sep 27 02:18:45 2006 From: mmhohman at northwestern.edu (Moses M. Hohman) Date: Tue, 26 Sep 2006 23:18:45 -0700 Subject: [BioRuby] Blast parsing speed In-Reply-To: <4A15B31A-2924-42D1-BC4B-354145686CCB@unil.ch> References: <4A15B31A-2924-42D1-BC4B-354145686CCB@unil.ch> Message-ID: <0700FC44-29EF-4F77-985A-C5D2841ABF4D@northwestern.edu> Hi Yannick, Sounds like bioruby is reading the entire DOM tree of the blast output XML into memory (with all the paging, etc.). That looks like what's happening in bio/appl/blast/rexml.rb. It looks like if you have the xmlparser library installed (http://raa.ruby-lang.org/ project/xmlparser/), which is a SAX parser, it will use that, and that should solve you problem. We might want to look into using a pull parser instead of a DOM parser, i.e. in Ruby use rexml/parsers/pullparser instead of the rexml/document. Pull parsers are nice because they are as memory- efficient as SAX parsers but allow you to use a more familiar procedural programming style rather than an event-driven style (like in xmlparser). So, it's less an issue of the programming language, and more of the type of XML parser. Hope that helps, it's a guess but I think it's probably what you're encountering, Moses On Sep 24, 2006, at 6:28 AM, Yannick Wurm wrote: > Hi, > I have been happily using bioruby for the past year or so for my post- > blast analyses. Occasionally, I will have ~ 1gb blast result files > that need to be parsed. Here my machine may start paging and slows to > a crawl. > > Thus I wonder: > - has anyone benchmarked bioruby, bioperl, biojava, biopython when > processing the same file to compare speed and memory usage? > - For the sake of future compatibility, I have been use blast's xml > output. How much slower is it is to parse such an xml file relative > to a "normal" or tabular blast output? > > Cheers, > > Yannick > > -------------------------------------------- > yannick . wurm @ unil . ch > Ant Genomics, Ecology & Evolution @ Lausanne > http://www.unil.ch/dee/page28685_fr.html > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby >