[BioRuby] Blast parsing speed
Moses M. Hohman
mmhohman at northwestern.edu
Wed Sep 27 06:18:45 UTC 2006
Hi Yannick,
Sounds like bioruby is reading the entire DOM tree of the blast
output XML into memory (with all the paging, etc.). That looks like
what's happening in bio/appl/blast/rexml.rb. It looks like if you
have the xmlparser library installed (http://raa.ruby-lang.org/
project/xmlparser/), which is a SAX parser, it will use that, and
that should solve you problem.
We might want to look into using a pull parser instead of a DOM
parser, i.e. in Ruby use rexml/parsers/pullparser instead of the
rexml/document. Pull parsers are nice because they are as memory-
efficient as SAX parsers but allow you to use a more familiar
procedural programming style rather than an event-driven style (like
in xmlparser).
So, it's less an issue of the programming language, and more of the
type of XML parser.
Hope that helps, it's a guess but I think it's probably what you're
encountering,
Moses
On Sep 24, 2006, at 6:28 AM, Yannick Wurm wrote:
> Hi,
> I have been happily using bioruby for the past year or so for my post-
> blast analyses. Occasionally, I will have ~ 1gb blast result files
> that need to be parsed. Here my machine may start paging and slows to
> a crawl.
>
> Thus I wonder:
> - has anyone benchmarked bioruby, bioperl, biojava, biopython when
> processing the same file to compare speed and memory usage?
> - For the sake of future compatibility, I have been use blast's xml
> output. How much slower is it is to parse such an xml file relative
> to a "normal" or tabular blast output?
>
> Cheers,
>
> Yannick
>
> --------------------------------------------
> yannick . wurm @ unil . ch
> Ant Genomics, Ecology & Evolution @ Lausanne
> http://www.unil.ch/dee/page28685_fr.html
>
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>
More information about the BioRuby
mailing list