OFBiz
  1. OFBiz
  2. OFBIZ-2186

OutOfMemory provoked by Cache overflow

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: Release Branch 4.0, SVN trunk
    • Component/s: framework
    • Labels:
      None
    • Environment:

      Linux, JDK 1.5_15, Xmx set to 1156Mo

      Description

      In our production system, we had an OutOfMemoryError.
      I analyzed the generated Heap Dump and found the following:
      The cache entitycache.entity-list.default.ProductCategoryMember retains a heap of 369314128 Bytes.
      The LRUMap hold by this object is occupying this space (369314128) and this object has a stange state:

      • maxSize is set to 5000 (as set in the cache.properties)
      • size is 128930 => PROBLEM

      IN cache.properties:
      entitycache.entity-list.default.ProductCategoryMember.expireTime=3600000
      entitycache.entity-list.default.ProductCategoryMember.useSoftReference=true
      entitycache.entity-list.default.ProductCategoryMember.maxInMemory=5000
      entitycache.entity-list.default.ProductCategoryMember.maxSize=7500

      I analyzed the code of LRUMap and its usage in CacheLineTable and IMHO the bug is a missing synchonized in get():
      public CacheLine<V> get(Object key) {
      if (key == null)

      { if (Debug.verboseOn()) Debug.logVerbose("In CacheLineTable tried to get with null key, using NullObject" + this.cacheName, module); }

      return getNoCheck(key);
      }

      Since LRUMap extends LinkedHashMap, if you look at get method, it changes the state of the Map by calling e.recordAccess(this):
      public V get(Object key)

      { Entry<K,V> e = (Entry<K,V>)getEntry(key); if (e == null) return null; e.recordAccess(this); return e.value; }

      So the default of synchronization corrupts the state of LRUMap which grows indefinitely
      I will submit a patch for this on the trunk.
      Philippe
      www.ubik-ingenierie.com

      1. OfbizBug.zip
        241 kB
        Philippe Mouawad
      2. CacheLineTable-patch.txt
        0.6 kB
        Philippe Mouawad

        Issue Links

          Activity

          Hide
          Jacques Le Roux added a comment -

          Thanks Philippe,

          Your patch is in trunk at r882072, R9.04 at r882073.

          I was notsable to merge wih R4.0. If you could provide a patch it would be great...

          Show
          Jacques Le Roux added a comment - Thanks Philippe, Your patch is in trunk at r882072, R9.04 at r882073. I was notsable to merge wih R4.0. If you could provide a patch it would be great...
          Hide
          Philippe Mouawad added a comment -

          Hello M. Le Roux,
          Yes.

          • OFBIZ-2124 was blocking since the application was totally blocked
          • This bug was also critical since the application dumpled every day on the 3 servers

          We are now in production for 20 days with my 2 patches applied and the situation is OK.
          We have been tracking Memory and response Time, and we don't have memory leaks.

          Philippe.

          Show
          Philippe Mouawad added a comment - Hello M. Le Roux, Yes. OFBIZ-2124 was blocking since the application was totally blocked This bug was also critical since the application dumpled every day on the 3 servers We are now in production for 20 days with my 2 patches applied and the situation is OK. We have been tracking Memory and response Time, and we don't have memory leaks. Philippe.
          Hide
          Jacques Le Roux added a comment -

          Hi Philippe,

          When you used this patch have you already applied the patch you submitted in OFBIZ-2124 ?

          Show
          Jacques Le Roux added a comment - Hi Philippe, When you used this patch have you already applied the patch you submitted in OFBIZ-2124 ?
          Hide
          Philippe Mouawad added a comment -

          Hello Again,
          Concerning the Limitation on cache.
          On a big SITE with a rather important catalog, not limiting the cache leads to a lot of OutOfMemory .
          We experienced these problems at first that's why we are limiting more and more caches.
          So you should not think that cache limitation concerns only few caches.

          Philippe.

          Show
          Philippe Mouawad added a comment - Hello Again, Concerning the Limitation on cache. On a big SITE with a rather important catalog, not limiting the cache leads to a lot of OutOfMemory . We experienced these problems at first that's why we are limiting more and more caches. So you should not think that cache limitation concerns only few caches. Philippe.
          Hide
          Philippe Mouawad added a comment -

          Hello M. Jones,
          First thank you for your reactivity.
          Second here is the status about the two previous patches:

          • The first one (you proposed) made the production crash every day for 5 days, an OOM was generated at least once a day on each server
          • With the second one (my proposition) production has been up for 10 days (we did a jmap to verify the leak was solved as I said), furthermore we didn't notice slow response time (if you are worried about the synchronized block). The ecommerce site we are using ofbiz for is a big one with 3 Balanced JVM servers.

          Now about you patch, I am not sure that your patch cannot introduce a Java Deadlock risk since now there are 2 synchronization
          occuring is the case of put/delete (one on the CacheLineTable through synchronized and one on LRUMap through Collections.synchronizedMap wrapping).

          As you may guess we cannot afford testing this new patch with the risk of production crashing again, it is an ECommerce site.
          I am wondering why you don't want to take my patch since it is OK on a big production SITE and does not incur an important performance loss.

          Thanks for your answer.

          Show
          Philippe Mouawad added a comment - Hello M. Jones, First thank you for your reactivity. Second here is the status about the two previous patches: The first one (you proposed) made the production crash every day for 5 days, an OOM was generated at least once a day on each server With the second one (my proposition) production has been up for 10 days (we did a jmap to verify the leak was solved as I said), furthermore we didn't notice slow response time (if you are worried about the synchronized block). The ecommerce site we are using ofbiz for is a big one with 3 Balanced JVM servers. Now about you patch, I am not sure that your patch cannot introduce a Java Deadlock risk since now there are 2 synchronization occuring is the case of put/delete (one on the CacheLineTable through synchronized and one on LRUMap through Collections.synchronizedMap wrapping). As you may guess we cannot afford testing this new patch with the risk of production crashing again, it is an ECommerce site. I am wondering why you don't want to take my patch since it is OK on a big production SITE and does not incur an important performance loss. Thanks for your answer.
          Hide
          David E. Jones added a comment -

          Thank you for the additional information.

          Could you try the changes in SVN rev 751619? With this the LRUMap is made synchronized using the Collections.synchronizedMap method when it is first created, and that should make it thread safe (except for certain iteration related operations, but we aren't doing those with the memoryTable field as far as I can tell).

          This should fix the synchronized get problem, and the separately monitored get vs put/remove problem.

          Show
          David E. Jones added a comment - Thank you for the additional information. Could you try the changes in SVN rev 751619? With this the LRUMap is made synchronized using the Collections.synchronizedMap method when it is first created, and that should make it thread safe (except for certain iteration related operations, but we aren't doing those with the memoryTable field as far as I can tell). This should fix the synchronized get problem, and the separately monitored get vs put/remove problem.
          Hide
          Benoit Wiart added a comment - - edited

          M. Jones,

          CacheLineTable and LRUMap do not synchronize on the same lock

          • CacheLineTable synchronization use the CacheLineTable instance monitor
          • LRUMap synchronization use the LRUMap instance monitor

          the LRUMap (memoryTable) can be updated concurrently by 2 differents threads

          • one in the CacheLineTable #put or CacheLineTable#remove
          • another in the CacheLineTable#get

          this leads to the corruption of the LRUMap which is not thread-safe

          the CacheLineTable#get should be synchronized or the LRUMap should be replaced by a thead safe collection

          Benoit Wiart
          www.ubik-ingenierie.com

          Show
          Benoit Wiart added a comment - - edited M. Jones, CacheLineTable and LRUMap do not synchronize on the same lock CacheLineTable synchronization use the CacheLineTable instance monitor LRUMap synchronization use the LRUMap instance monitor the LRUMap (memoryTable) can be updated concurrently by 2 differents threads one in the CacheLineTable #put or CacheLineTable#remove another in the CacheLineTable#get this leads to the corruption of the LRUMap which is not thread-safe the CacheLineTable#get should be synchronized or the LRUMap should be replaced by a thead safe collection Benoit Wiart www.ubik-ingenierie.com
          Hide
          David E. Jones added a comment -

          What are the 2 different objects? I appreciate the clue, but no I'm not going to hunt it down on my own if you already know the answer...

          Also please feel free to submit a patch, but understand I'll only accept it if it follows the pattern used so far where the functionality for the size limit on the cache will not slow down the cache when it is turned off.

          Show
          David E. Jones added a comment - What are the 2 different objects? I appreciate the clue, but no I'm not going to hunt it down on my own if you already know the answer... Also please feel free to submit a patch, but understand I'll only accept it if it follows the pattern used so far where the functionality for the size limit on the cache will not slow down the cache when it is turned off.
          Hide
          Philippe Mouawad added a comment -

          Hello M. Jones,
          A colleague (Benoit) examined in detail your patch, the problem comes from the fact that the synchronized block is on 2 different objects so synchronization fails.

          Show
          Philippe Mouawad added a comment - Hello M. Jones, A colleague (Benoit) examined in detail your patch, the problem comes from the fact that the synchronized block is on 2 different objects so synchronization fails.
          Hide
          Philippe Mouawad added a comment -

          Hello M. Jones,
          Thank you for taking into account this bug.
          We tested the patch you proposed in production, the same problem occured, the size of LRUMap exceed the configured maxSize.
          We tested the initial patch I proposed and the problem is solved, we did a voluntary heap dump using jmap, and the hprof file generated shows that the size of LRUMap is OK.

          I hope you can take into account this remark as soon as possible.
          Philippe
          www.ubik-ingenierie.com

          Show
          Philippe Mouawad added a comment - Hello M. Jones, Thank you for taking into account this bug. We tested the patch you proposed in production, the same problem occured, the size of LRUMap exceed the configured maxSize. We tested the initial patch I proposed and the problem is solved, we did a voluntary heap dump using jmap, and the hprof file generated shows that the size of LRUMap is OK. I hope you can take into account this remark as soon as possible. Philippe www.ubik-ingenierie.com
          Hide
          David E. Jones added a comment -

          Philippe,

          Please understand that what you're saying now is very different than what you said before. I wasn't questioning your conclusion, I was just responding to your statement that you opinion was that the non-synchronized get was the problem: "IMHO the bug is a missing synchonized in get()".

          I totally agree that if there is code that changes things it needs to be synchronized, and we should look into that more, and how to isolate it so that it is done only when the LRU list is used (which is not the common case, BTW).

          Thank you for providing more detail and being more clear in your additional comments.

          This one was actually pretty easy, so I made some changes in SVN rev 744205 that should resolve the issue, and only have an effect on performance if there is a size limit on the cache.

          Please comment if this fixes the problem you're running into, and if so we can close the issue... and if not we'll have to look into it more!

          Show
          David E. Jones added a comment - Philippe, Please understand that what you're saying now is very different than what you said before. I wasn't questioning your conclusion, I was just responding to your statement that you opinion was that the non-synchronized get was the problem: "IMHO the bug is a missing synchonized in get()". I totally agree that if there is code that changes things it needs to be synchronized, and we should look into that more, and how to isolate it so that it is done only when the LRU list is used (which is not the common case, BTW). Thank you for providing more detail and being more clear in your additional comments. This one was actually pretty easy, so I made some changes in SVN rev 744205 that should resolve the issue, and only have an effect on performance if there is a size limit on the cache. Please comment if this fixes the problem you're running into, and if so we can close the issue... and if not we'll have to look into it more!
          Hide
          Philippe Mouawad added a comment -

          Mimics the bug.
          LRUMap is Ofbiz class
          CacheLineUsage uses the LRUMap the way CacheLineTable uses it, no synchro on get
          TestLRU run some threads
          Run this with -Xmx set to 5Mo

          Show
          Philippe Mouawad added a comment - Mimics the bug. LRUMap is Ofbiz class CacheLineUsage uses the LRUMap the way CacheLineTable uses it, no synchro on get TestLRU run some threads Run this with -Xmx set to 5Mo
          Hide
          Philippe Mouawad added a comment -

          Hello M. Jones,
          If you don't think that my arguments are correct, can you explain why ?
          I agree with you about the synchronization performance loss but is it acceptable
          that this absence of synchronization provokes a server crash through OOM ?
          I will propose another solution soon based on another implementation of LRU but
          meanwhile we will put this in production and come back with results.

          Now about my explanation, since a LRU is used, LRU is computed based on get(),
          look at the code of LinkedHashMap and you will see that get() modifies the state
          of the Map, if so Synchronization is mandatory.

          Tell me what kind of info you need to be convinced.

          To be more convinced run the program I attach with the following options on a JDK 5:
          -Xmx5m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=C:/temp/

          It will throw an OutOfMemory.
          Add synchronized to CacheLineUsage#get() and see NO MORE PROBLEM.

          Inspect the Heap Dump and you get exactly what I told you:
          size of Map is 31362 while maxSize is 5000
          Memory occupation is 6056872

          Show
          Philippe Mouawad added a comment - Hello M. Jones, If you don't think that my arguments are correct, can you explain why ? I agree with you about the synchronization performance loss but is it acceptable that this absence of synchronization provokes a server crash through OOM ? I will propose another solution soon based on another implementation of LRU but meanwhile we will put this in production and come back with results. Now about my explanation, since a LRU is used, LRU is computed based on get(), look at the code of LinkedHashMap and you will see that get() modifies the state of the Map, if so Synchronization is mandatory. Tell me what kind of info you need to be convinced. To be more convinced run the program I attach with the following options on a JDK 5: -Xmx5m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=C:/temp/ It will throw an OutOfMemory. Add synchronized to CacheLineUsage#get() and see NO MORE PROBLEM. Inspect the Heap Dump and you get exactly what I told you: size of Map is 31362 while maxSize is 5000 Memory occupation is 6056872
          Hide
          David E. Jones added a comment -

          It sounds like it is in your opinion that the problem is the missing synchronization.

          IMO, that would be really surprising...

          Before considering a patch we really need to know what is causing the problem and come up with a fix.

          Adding a blanket synchronization will slow things down a LOT, so we should do whatever is needed to find an alternative. Still, it doesn't matter because unless we know this fixes the problem we should not consider the change.

          Show
          David E. Jones added a comment - It sounds like it is in your opinion that the problem is the missing synchronization. IMO, that would be really surprising... Before considering a patch we really need to know what is causing the problem and come up with a fix. Adding a blanket synchronization will slow things down a LOT, so we should do whatever is needed to find an alternative. Still, it doesn't matter because unless we know this fixes the problem we should not consider the change.
          Hide
          Philippe Mouawad added a comment -

          Addition of synchronized to get()

          Show
          Philippe Mouawad added a comment - Addition of synchronized to get()

            People

            • Assignee:
              Jacques Le Roux
              Reporter:
              Philippe Mouawad
            • Votes:
              4 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development