[BioRuby] Parsing large blastout.xml files
Pjotr Prins
pjotr.public14 at thebird.nl
Sat Nov 7 07:42:44 UTC 2009
I did the same a while back using xmltwig:
http://github.com/pjotrp/biotools/blob/master/bin/blast_split_xml
On Fri, Nov 06, 2009 at 10:55:45AM +0800, Rob Syme wrote:
> I'm trying to extract information from a large blast xml file. To parse the
> xml file, ruby reads the whole file into memory before looking at each
> entry. For large files (2.5GBish) - the memory requirements become severe.
>
> My first approach was to split each query up into its own <BlastOutput> xml
> instance, so that
>
> <BlastOutput>
> <BlastOutput_iterations>
> <Iteration>
> <Iteration_hits>
> <Hit></Hit>
> <Hit></Hit>
> <Hit></Hit>
> </Iteration_hits>
> </Iteration>
> <Iteration>
> <Iteration_hits>
> <Hit></Hit>
> <Hit></Hit>
> <Hit></Hit>
> </Iteration_hits>
> </Iteration>
> <Iteration>
> <Iteration_hits>
> <Hit></Hit>
> <Hit></Hit>
> <Hit></Hit>
> </Iteration_hits>
> </Iteration>
> </BlastOutput_iterations>
> </BlastOutput>
>
> Would end up looking more like:
> <BlastOutput>
> <BlastOutput_iterations>
> <Iteration>
> <Iteration_hits>
> <Hit></Hit>
> <Hit></Hit>
> <Hit></Hit>
> </Iteration_hits>
> </Iteration>
> </BlastOutput_iterations>
> </BlastOutput>
>
> <BlastOutput>
> <BlastOutput_iterations>
> <Iteration>
> <Iteration_hits>
> <Hit></Hit>
> <Hit></Hit>
> <Hit></Hit>
> </Iteration_hits>
> </Iteration>
> </BlastOutput_iterations>
> </BlastOutput>
>
> <BlastOutput>
> <BlastOutput_iterations>
> <Iteration>
> <Iteration_hits>
> <Hit></Hit>
> <Hit></Hit>
> <Hit></Hit>
> </Iteration_hits>
> </Iteration>
> </BlastOutput_iterations>
> </BlastOutput>
>
> Which bioruby has trouble parsing, so the <BlastOutput>s had to be given
> their own file:
>
> $ csplit --prefix="tinyxml_"segmented_blastout.xml '/<\?xml/' '{*}'
>
> Now each file can be parsed individually. I feel like there has to be an
> easier way. Is there a way to parse large xml files without huge memory
> overheads, or is that just par for the course?
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
More information about the BioRuby
mailing list