[Biojava-l] Parsing massive blast-like output (was... Problems
with SAX parsing)
Matthew Pocock
matthew_pocock at yahoo.co.uk
Fri Feb 14 21:06:03 EST 2003
Great. Thanks Simon. I hate tracking down these reference leaks.
Matthew
Simon Brocklehurst wrote:
> Re: Parsing massive Blast output
>
> In regard of recent mail to the list (and from up to a couple of years ago)
>
> Up 'til now, when attempting to parse *very* large blast outputs
> consisting of many (thousands of) separete reports concatenated
> together, the Java Virtual Machine could sometimes run out of memory.
> The workaround for this problem that people have been using was to split
> their output into smaller chunks that the parser can deal with.
>
> This parsing problem was due to a small bug, which we've now (I
> think/hope) fixed in the biojava cvs (biojava-live).
>
> The parser should now deal successfully with infinitely large amounts of
> data, without any need for chunking the output.
>
> After applying this fix, the "BlastLike" SAX parser was tested for
> scalability in terms of handling large numbers of concatenated blast
> reports as follows:
>
> Size measures of typical test input files:
>
> o Tens of thousands of concatenated blast-like reports
>
> o Tens of millions of individual lines of blast-like pairwise output data
>
> o Gigabytes in size
>
> Tests were run using JDK 1.4.1 on Solaris 9. Input data was parsed in
> such a way as to process all SAX events generated by the underlying SAX
> driver.
>
> o For each test, the outputs from the parser were XML documents each of
> the order of hundreds of millions of lines in size.
>
> o Memory footprint remained both small and constant throughout the
> parsing process, with a typical memory footprint under 14 MB in size.
>
>
> Simon
--
BioJava Consulting LTD - Support and training for BioJava
http://www.biojava.co.uk
More information about the Biojava-l
mailing list