[Biojava-dev] [BioJava - Bug #3345] (Closed) Static object cache in SimpleRichObjectBuilder causing memory leak
Andreas Prlic
andreas at sdsc.edu
Fri May 11 23:05:57 UTC 2012
Hi,
We really need somebody to write a new genebank parser for the biojava
3 world.... Any volunteers? :-)
Andreas
On Fri, May 11, 2012 at 2:16 PM, George Waldon <gwaldon at geneinfinity.org> wrote:
> Tjeerd,
>
> It is not a bug, it is a design for a singleton map to interact with a
> database to actually address the issue you had in the first place. If you
> change it, hibernate sessions may not work properly or memory issues may
> arise if too many sequences are loaded at the same time because the current
> code also prevent duplicate objects.
>
> What about submitting your own version of RichObjectBuilder with a clear
> description of when and how to use it?
>
>
> - George
>
> Quoting Tjeerd Boerman <twboerman at gmail.com>:
>
>> Hi,
>>
>> I've fixed the memory issue for me personally, but it may be a problem for
>> future users, hence the question.
>>
>> As far I can see changing the default caching strategy will not break any
>> contracts. It simply means taking a small (negligible?) performance hit in
>> order to be able to run in a bounded memory space. Would you agree?
>>
>> Tjeerd
>>
>> On 5/11/2012 5:34 PM, George Waldon wrote:
>>>
>>> Hi Tjeerd,
>>>
>>> Bj1x was designed and written by Richard Holland; that would be the
>>> person to ask directly I guess:-).
>>>
>>> It is not desirable to change behavior retroactively. Are you still
>>> facing memory issues? Could you be more explicit on your problem?
>>>
>>> There are several ways we could adjust the current code without breaking
>>> the original design; e.g. use of service providers or implement a
>>> getRichObjectBuilder method to swap the builder for a particular job. Just
>>> thoughts.
>>>
>>> - George
>>>
>>>
>>> Quoting Tjeerd Boerman <twboerman at gmail.com>:
>>>
>>>> Any thoughts on this from the Biojava authors? I strongly believe that
>>>> the default behavior of BioJava should not be "unbounded memory usage unless
>>>> you implement your own RichObjectBuilder".
>>>>
>>>> - Tjeerd
>>>>
>>>> On 4/27/2012 3:45 PM, George Waldon wrote:
>>>>>
>>>>> You can use RichObjectFactory.setRichObjectBuilder(RichObjectBuilder b)
>>>>> to enable you own builder.
>>>>>
>>>>> - George
>>>>>
>>>>> Quoting Tjeerd Boerman <twboerman at gmail.com>:
>>>>>
>>>>>> Hi George,
>>>>>>
>>>>>> A very good point. While the specification of the RichObjectBuilder
>>>>>> interface does not *enforce* that implementations persist the objects, it
>>>>>> seems to be the intended use. Both implementations (BioSQLRichObjectBuilder
>>>>>> and SimpleRichObjectBuilder) persist all created objects. Creating my own
>>>>>> class would work for me, but this should be fixed in BioJava codebase as
>>>>>> well, correct? Creating another RichObjectBuilder implementation that works
>>>>>> in finite memory would work, but it would have to be 'plugged in' by editing
>>>>>> the RichObjectFactory or enabled through a configuration option - both seem
>>>>>> undesirable. Any ideas?
>>>>>>
>>>>>> Tjeerd
>>>>>>
>>>>>> On 4/27/2012 6:44 AM, George Waldon wrote:
>>>>>>>
>>>>>>> Hi Tjeerd,
>>>>>>>
>>>>>>> The use of a singleton map for rich objects with a Hibernate database
>>>>>>> is described in the biojavax documentation; see also the class
>>>>>>> BioSQLRichObjectBuilder.
>>>>>>>
>>>>>>> You can implement your solution simply by creating your own
>>>>>>> implementation of RichObjectBuilder as I mentioned before. Maybe you could
>>>>>>> store the cache values as WeakReference and recreated objects only when
>>>>>>> needed. Or you could filter the list of parameters, e.g. in the case of
>>>>>>> SimpleDocRef return the same singleton representing an empty DocRef if you
>>>>>>> are not interested in those, etc.
>>>>>>>
>>>>>>> - George
>>>>>>>
>>>>>>> Quoting Tjeerd Boerman <twboerman at gmail.com>:
>>>>>>>
>>>>>>>> Hey,
>>>>>>>>
>>>>>>>> I cannot reopen the bug myself, but I'm sure Andreas or another
>>>>>>>> manager of the bugtracker can.
>>>>>>>>
>>>>>>>> For my purposes the memory usage could be kept in check by disabling
>>>>>>>> the cache in SimpleRichObjectBuilder. It's hard to say what the performance
>>>>>>>> hit is, but it's not been noticeable in my application. A permanent solution
>>>>>>>> needs some more thought.
>>>>>>>>
>>>>>>>> I double-checked, and all calls to SimpleObjectBuilder.getObject()
>>>>>>>> go through RichObjectFactory.getObject() - it effectively delegates the call
>>>>>>>> to the SimpleObjectBuilder. The SimpleObjectBuilder class keeps a cache of
>>>>>>>> unrestricted size, with one object for every unique combination of class and
>>>>>>>> parameters. The RichObjectFactory class also keeps a cache of objects, but
>>>>>>>> this cache is restricted to containing instances for at most 20 different
>>>>>>>> object classes at any given time.
>>>>>>>> So we can think of this as a level 1 cache in RichObjectFactory, and
>>>>>>>> a level 2 cache in SimpleObjectBuilder. The level 1 cache may contain at
>>>>>>>> most 20 entries, and the level 2 cache may contain at most the amount of
>>>>>>>> instantiable classes in BioJava, which will be a couple hundred *maximum*. I
>>>>>>>> don't think this difference justifies the use of two caches.
>>>>>>>>
>>>>>>>> Possible fix:
>>>>>>>> Unless I missed something, removing the cache from
>>>>>>>> SimpleObjectBuilder and increasing the size of the LRU cache in
>>>>>>>> RichObjectFactory to compensate seems like a good option. This would solve
>>>>>>>> the unbounded growth of the level 2 cache by eliminating it.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> There is another issue with the level 1 cache that could cause
>>>>>>>> unbounded growth. Look at this piece of code in
>>>>>>>> RichObjectFactory.getObject():
>>>>>>>>
>>>>>>>> Map m = (Map)cache.get(applicationClass);
>>>>>>>> if (!m.containsKey(paramsList)) {
>>>>>>>> m.put(paramsList,builder.buildObject(applicationClass,
>>>>>>>> paramsList));
>>>>>>>> }
>>>>>>>>
>>>>>>>> For every unique combination of parameters, the object is built and
>>>>>>>> stored in the Map associated with that class. This Map is not a LRU
>>>>>>>> implementation, so creating 10,000,000 instances of a class (i.e.
>>>>>>>> SimpleDocRef) with different parameters would cause every single instance to
>>>>>>>> be saved in the cache. Such a situation would again cause memory problems.
>>>>>>>>
>>>>>>>> Possible fix:
>>>>>>>> Well that's an easy one! Replace this simple Map cache with a LRU
>>>>>>>> implementation, as proposed before.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Tjeerd
>>>>>>>>
>>>>>>>>
>>>>>>>> On 4/27/2012 12:03 AM, George Waldon wrote:
>>>>>>>>>
>>>>>>>>> Hi Tjeerd,
>>>>>>>>>
>>>>>>>>> You can certainly reopen a bug.
>>>>>>>>>
>>>>>>>>> I am not the expert here, but looking at the code,
>>>>>>>>> RichObjectFactory uses a LRU cache as a temporary cache for objects that are
>>>>>>>>> reused often while SimpleRichObjectBuilder caches everything. I doubt you
>>>>>>>>> can remove this last one.
>>>>>>>>>
>>>>>>>>> Maybe you can write your own RichObjectBuilder that would cache
>>>>>>>>> only things you are interested in (to compare?). I bet you generate a lot of
>>>>>>>>> DocRef that do not need to be cached; then use
>>>>>>>>> RichObjectFactory.setRichObjectBuilder to set your partial builder before
>>>>>>>>> you parse. Just an idea.
>>>>>>>>>
>>>>>>>>> Hope this helps,
>>>>>>>>> George
>>>>>>>>>
>>>>>>>>> Quoting Tjeerd Boerman <twboerman at gmail.com>:
>>>>>>>>>
>>>>>>>>>> Wow. I'm quite emberassed, but the issue is actually not fixed in
>>>>>>>>>> 1.8.2. The RichObjectFactory class uses an LRU cache, while the
>>>>>>>>>> SimpleRichObjectBuilder class uses the cache described in my earlier post,
>>>>>>>>>> and I mixed up the two. Sorry about that, I must have been staring at this
>>>>>>>>>> code for too long today.
>>>>>>>>>>
>>>>>>>>>> It turns out the RichObjectFactory class is the only user of
>>>>>>>>>> SimpleRichObjectBuilder, and has its own LRU cache. So the cache in
>>>>>>>>>> SimpleRichObjectBuilder can probably be removed altogether. If the experts
>>>>>>>>>> agree I would be happy to write a patch for this.
>>>>>>>>>>
>>>>>>>>>> Can the old bug report be reopened, or should I file a new bug
>>>>>>>>>> report? Or should I submit a possible patch through this mailing list?
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Tjeerd
>>>>>>>>>>
>>>>>>>>>> On 4/26/2012 6:01 PM, redmine at redmine.open-bio.org wrote:
>>>>>>>>>>>
>>>>>>>>>>> Issue #3345 has been updated by Andreas Prlic.
>>>>>>>>>>>
>>>>>>>>>>> * Status changed from New to Closed
>>>>>>>>>>> * % Done changed from 0 to 100
>>>>>>>>>>>
>>>>>>>>>>> already fixed in 1.8.2
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>>> Bug #3345: Static object cache in SimpleRichObjectBuilder causing
>>>>>>>>>>> memory leak <https://redmine.open-bio.org/issues/3345>
>>>>>>>>>>>
>>>>>>>>>>> * Author: Tjeerd Boerman
>>>>>>>>>>> * Status: Closed
>>>>>>>>>>> * Priority: Normal
>>>>>>>>>>> * Assignee: biojava-dev list
>>>>>>>>>>> * Category: bio
>>>>>>>>>>> * Target version: BioJava 1.8 - legacy
>>>>>>>>>>> * URL:
>>>>>>>>>>>
>>>>>>>>>>> I encountered a memory problem when parsing many Genbank files
>>>>>>>>>>> with the Biojava 1.8.1. The parsed files were protein GPFF (GenPept Flat
>>>>>>>>>>> File format) files from the latest RefSeq release. The application tried to
>>>>>>>>>>> parse millions of protein sequences from these files, but an
>>>>>>>>>>> OutOfMemoryException would always occur after some time. The used heap space
>>>>>>>>>>> would gradually increase from a couple hundred megabytes to over 1.5 GB,
>>>>>>>>>>> until the heap could grow no further. Upon inspection I discovered a HashMap
>>>>>>>>>>> in RichSequenceBuilder was the culprit:
>>>>>>>>>>>
>>>>>>>>>>> public class SimpleRichObjectBuilder implements RichObjectBuilder
>>>>>>>>>>> {
>>>>>>>>>>>
>>>>>>>>>>> private static Map objects = new HashMap();
>>>>>>>>>>>
>>>>>>>>>>> public Object buildObject(Class clazz, List paramsList) {
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>> // return the constructed object from the hashmap if there
>>>>>>>>>>> already
>>>>>>>>>>> if (contents.containsKey(ourParamsList)) return
>>>>>>>>>>> contents.get(ourParamsList);
>>>>>>>>>>>
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>> // Instantiate it with the parameters
>>>>>>>>>>> Object o = c.newInstance(ourParamsList.toArray());
>>>>>>>>>>>
>>>>>>>>>>> // store it for later in the singleton map
>>>>>>>>>>> contents.put(ourParamsList, o);
>>>>>>>>>>>
>>>>>>>>>>> ...
>>>>>>>>>>> }
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> It seems the *objects* Map in SimpleRichSequenceBuilder is used
>>>>>>>>>>> as a static cache for objects, but when many different objects are created
>>>>>>>>>>> this cache grows out of control. I am unsure if this is a 'true' bug, but
>>>>>>>>>>> for my application it was a definite problem. My fix was to simply comment
>>>>>>>>>>> out the *contents.put()* statement, but I'm sure there is a better way to
>>>>>>>>>>> resolve this - perhaps by making the use of the cache optional through a
>>>>>>>>>>> configuration option.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------------------------------------------------ You
>>>>>>>>>>> have received this notification because you have either subscribed to it, or
>>>>>>>>>>> are involved in it.
>>>>>>>>>>> To change your notification preferences, please click here and
>>>>>>>>>>> login: http://redmine.open-bio.org
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> biojava-dev mailing list
>>>>>>>>>> biojava-dev at lists.open-bio.org
>>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --------------------------------
>>>>>>>>> George Waldon
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>
>
>
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
More information about the biojava-dev
mailing list