[BioRuby] Parsing large blastout.xml files
Adam
adamnkraut at gmail.com
Fri Nov 6 03:17:02 UTC 2009
You might want to try a SAX Parser instead.
REXML from the standard library has a streaming API. LibXML is a lot faster
and it's available as a gem.
http://libxml.rubyforge.org/
On Thu, Nov 5, 2009 at 9:55 PM, Rob Syme <rob.syme at gmail.com> wrote:
> I'm trying to extract information from a large blast xml file. To parse the
> xml file, ruby reads the whole file into memory before looking at each
> entry. For large files (2.5GBish) - the memory requirements become severe.
>
> My first approach was to split each query up into its own <BlastOutput> xml
> instance, so that
>
> <BlastOutput>
> <BlastOutput_iterations>
> <Iteration>
> <Iteration_hits>
> <Hit></Hit>
> <Hit></Hit>
> <Hit></Hit>
> </Iteration_hits>
> </Iteration>
> <Iteration>
> <Iteration_hits>
> <Hit></Hit>
> <Hit></Hit>
> <Hit></Hit>
> </Iteration_hits>
> </Iteration>
> <Iteration>
> <Iteration_hits>
> <Hit></Hit>
> <Hit></Hit>
> <Hit></Hit>
> </Iteration_hits>
> </Iteration>
> </BlastOutput_iterations>
> </BlastOutput>
>
> Would end up looking more like:
> <BlastOutput>
> <BlastOutput_iterations>
> <Iteration>
> <Iteration_hits>
> <Hit></Hit>
> <Hit></Hit>
> <Hit></Hit>
> </Iteration_hits>
> </Iteration>
> </BlastOutput_iterations>
> </BlastOutput>
>
> <BlastOutput>
> <BlastOutput_iterations>
> <Iteration>
> <Iteration_hits>
> <Hit></Hit>
> <Hit></Hit>
> <Hit></Hit>
> </Iteration_hits>
> </Iteration>
> </BlastOutput_iterations>
> </BlastOutput>
>
> <BlastOutput>
> <BlastOutput_iterations>
> <Iteration>
> <Iteration_hits>
> <Hit></Hit>
> <Hit></Hit>
> <Hit></Hit>
> </Iteration_hits>
> </Iteration>
> </BlastOutput_iterations>
> </BlastOutput>
>
> Which bioruby has trouble parsing, so the <BlastOutput>s had to be given
> their own file:
>
> $ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}'
>
> Now each file can be parsed individually. I feel like there has to be an
> easier way. Is there a way to parse large xml files without huge memory
> overheads, or is that just par for the course?
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>
More information about the BioRuby
mailing list