[BioRuby] Parsing large blastout.xml files
Rob Syme
rob.syme at gmail.com
Fri Nov 6 02:55:45 UTC 2009
I'm trying to extract information from a large blast xml file. To parse the
xml file, ruby reads the whole file into memory before looking at each
entry. For large files (2.5GBish) - the memory requirements become severe.
My first approach was to split each query up into its own <BlastOutput> xml
instance, so that
<BlastOutput>
<BlastOutput_iterations>
<Iteration>
<Iteration_hits>
<Hit></Hit>
<Hit></Hit>
<Hit></Hit>
</Iteration_hits>
</Iteration>
<Iteration>
<Iteration_hits>
<Hit></Hit>
<Hit></Hit>
<Hit></Hit>
</Iteration_hits>
</Iteration>
<Iteration>
<Iteration_hits>
<Hit></Hit>
<Hit></Hit>
<Hit></Hit>
</Iteration_hits>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>
Would end up looking more like:
<BlastOutput>
<BlastOutput_iterations>
<Iteration>
<Iteration_hits>
<Hit></Hit>
<Hit></Hit>
<Hit></Hit>
</Iteration_hits>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>
<BlastOutput>
<BlastOutput_iterations>
<Iteration>
<Iteration_hits>
<Hit></Hit>
<Hit></Hit>
<Hit></Hit>
</Iteration_hits>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>
<BlastOutput>
<BlastOutput_iterations>
<Iteration>
<Iteration_hits>
<Hit></Hit>
<Hit></Hit>
<Hit></Hit>
</Iteration_hits>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>
Which bioruby has trouble parsing, so the <BlastOutput>s had to be given
their own file:
$ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}'
Now each file can be parsed individually. I feel like there has to be an
easier way. Is there a way to parse large xml files without huge memory
overheads, or is that just par for the course?
More information about the BioRuby
mailing list