[BioRuby] Parsing large blastout.xml files

Sat Nov 7 07:42:44 UTC 2009

I did the same a while back using xmltwig:

  http://github.com/pjotrp/biotools/blob/master/bin/blast_split_xml

On Fri, Nov 06, 2009 at 10:55:45AM +0800, Rob Syme wrote:
> I'm trying to extract information from a large blast xml file. To parse the
> xml file, ruby reads the whole file into memory before looking at each
> entry. For large files (2.5GBish) - the memory requirements become severe.
> 
> My first approach was to split each query up into its own <BlastOutput> xml
> instance, so that
> 
> <BlastOutput>
>   <BlastOutput_iterations>
>     <Iteration>
>       <Iteration_hits>
>         <Hit></Hit>
>         <Hit></Hit>
>         <Hit></Hit>
>       </Iteration_hits>
>     </Iteration>
>     <Iteration>
>       <Iteration_hits>
>         <Hit></Hit>
>         <Hit></Hit>
>         <Hit></Hit>
>       </Iteration_hits>
>     </Iteration>
>     <Iteration>
>       <Iteration_hits>
>         <Hit></Hit>
>         <Hit></Hit>
>         <Hit></Hit>
>       </Iteration_hits>
>     </Iteration>
>   </BlastOutput_iterations>
> </BlastOutput>
> 
> Would end up looking more like:
> <BlastOutput>
>   <BlastOutput_iterations>
>     <Iteration>
>       <Iteration_hits>
>         <Hit></Hit>
>         <Hit></Hit>
>         <Hit></Hit>
>       </Iteration_hits>
>     </Iteration>
>   </BlastOutput_iterations>
> </BlastOutput>
> 
> <BlastOutput>
>   <BlastOutput_iterations>
>     <Iteration>
>       <Iteration_hits>
>         <Hit></Hit>
>         <Hit></Hit>
>         <Hit></Hit>
>       </Iteration_hits>
>     </Iteration>
>   </BlastOutput_iterations>
> </BlastOutput>
> 
> <BlastOutput>
>   <BlastOutput_iterations>
>     <Iteration>
>       <Iteration_hits>
>         <Hit></Hit>
>         <Hit></Hit>
>         <Hit></Hit>
>       </Iteration_hits>
>     </Iteration>
>   </BlastOutput_iterations>
> </BlastOutput>
> 
> Which bioruby has trouble parsing, so the <BlastOutput>s had to be given
> their own file:
> 
> $ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}'
> 
> Now each file can be parsed individually. I feel like there has to be an
> easier way. Is there a way to parse large xml files without huge memory
> overheads, or is that just par for the course?
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby