[BioRuby] Parsing large blastout.xml files
    Rob Syme 
    rob.syme at gmail.com
       
    Fri Nov  6 02:55:45 UTC 2009
    
    
  
I'm trying to extract information from a large blast xml file. To parse the
xml file, ruby reads the whole file into memory before looking at each
entry. For large files (2.5GBish) - the memory requirements become severe.
My first approach was to split each query up into its own <BlastOutput> xml
instance, so that
<BlastOutput>
  <BlastOutput_iterations>
    <Iteration>
      <Iteration_hits>
        <Hit></Hit>
        <Hit></Hit>
        <Hit></Hit>
      </Iteration_hits>
    </Iteration>
    <Iteration>
      <Iteration_hits>
        <Hit></Hit>
        <Hit></Hit>
        <Hit></Hit>
      </Iteration_hits>
    </Iteration>
    <Iteration>
      <Iteration_hits>
        <Hit></Hit>
        <Hit></Hit>
        <Hit></Hit>
      </Iteration_hits>
    </Iteration>
  </BlastOutput_iterations>
</BlastOutput>
Would end up looking more like:
<BlastOutput>
  <BlastOutput_iterations>
    <Iteration>
      <Iteration_hits>
        <Hit></Hit>
        <Hit></Hit>
        <Hit></Hit>
      </Iteration_hits>
    </Iteration>
  </BlastOutput_iterations>
</BlastOutput>
<BlastOutput>
  <BlastOutput_iterations>
    <Iteration>
      <Iteration_hits>
        <Hit></Hit>
        <Hit></Hit>
        <Hit></Hit>
      </Iteration_hits>
    </Iteration>
  </BlastOutput_iterations>
</BlastOutput>
<BlastOutput>
  <BlastOutput_iterations>
    <Iteration>
      <Iteration_hits>
        <Hit></Hit>
        <Hit></Hit>
        <Hit></Hit>
      </Iteration_hits>
    </Iteration>
  </BlastOutput_iterations>
</BlastOutput>
Which bioruby has trouble parsing, so the <BlastOutput>s had to be given
their own file:
$ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}'
Now each file can be parsed individually. I feel like there has to be an
easier way. Is there a way to parse large xml files without huge memory
overheads, or is that just par for the course?
    
    
More information about the BioRuby
mailing list