[Biojava-l] BLAST parsing explodes in size
Keith James
kdj at sanger.ac.uk
Tue Nov 11 11:21:52 EST 2003
>>>>> "FV" == VERHOEF Frans <verhoeff2 at gis.a-star.edu.sg> writes:
FV> Hi, I am having a problem parsing huge blast
FV> results. Basically I am parsing the blast results pretty much
FV> the same way as in "Biojava in Anger", with as only difference
FV> that I use the setModeLazy() of the BlastLikeSAXParser, since
FV> I am using NCBI Blast version 2.2.4 and that version is not
FV> recognised by the parser yet.
Using blast 2.2.4 or 2.2.6 is safe in lazy mode - diffs show only
minor whitespace changes in the format.
FV> Besides that the only difference lays in the things I do with
FV> the data.
This is likely to be the cause of the problem. See below.
FV> The problem is that when I parse a blast result that is a few
FV> hundred MB, for example 300MB, the java application is
FV> ballooning up to around 1.6GB of memory. Sometimes the
FV> application even crashes because I only have got 2GB to play
FV> with.
The parser uses an event driven framework which is designed to handle
very big data - it will handle multi-GB reports. However, if you
create many fine-grained objects for every element of every report you
will quickly run out of resources.
FV> Does anyone know what's causing this? Is it because I set the
FV> lazy mode? Is there any way to work around it?
Either you need to think about which elements of the report you are
interested in and build a filter which captures those events,
discarding the rest. See the demos/ssbind package for an example by
Matthew. Or if you really need all those objects then you should look
at allowing them to be garbage-collected as soon as possible.
It is possible that there is a bug somewhere, but without any seeing
any code it isn't possible to say much more. If you need more help,
post a short (working) piece of code illustrating the problem and we
will do our best.
hth
Keith
--
- Keith James <kdj at sanger.ac.uk> Microarray Facility, Team 65 -
- The Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK -
More information about the Biojava-l
mailing list