Details

    • Urgency:
      Normal
    • Issue & fix info:
      Known fix, Workaround attached
    • Bug behavior facts:
      Crash

      Description

      When using JTA for transaction control and a transaction timeout is set,
      EmbedXAResource ends up calling XATransactionState.scheduleTimeoutTask() which
      in turn registers a timeoutTask with java.util.Timer. In the normal case where
      the transaction finishes before the timeout, XATransactionState.xa_finalize()
      then calls timeoutTask.cancel(). So far this so good. The problem, however, is
      that java.util.TimerTask.cancel() does not actually remove the task from the
      timer queue, meaning that a strong reference to the timeoutTask is kept (and
      through that to XATransactionState, the EmbedConnection, etc). The reference
      is not removed until the time at which the timeout would have fired, which can
      be a long time. Under load this can quickly lead to an OOM situation.

      A simple fix is to call Timer.purge() every so often. While the javadocs talk
      about purge() being rarely needed and that it's not extremely cheap, I've
      found that calling it after every cancel() is the best approach, for several
      reasons: 1) the scenario here is that almost all tasks are cancelled, and
      hence this somewhat fits the Timer.purge() description of an "application that
      cancels a large number of tasks"; 2) there usually isn't a very large number
      of simultaneous transactions, and hence purge() is actually quite cheap; 3)
      this ensures the strong reference is immediately removed, allowing the GC to
      do a better job. Interestingly enough, I've had this exact same issue on a
      different type of db, and I had tested the purge() there and found it to be in
      the sub-microsecond range for 100 transactions (or similar - I don't recall
      the exact data), i.e. completely negligible.

      In short, my suggestion is to change xa_finalize as follows:

      synchronized void xa_finalize() {
      if (timeoutTask != null)

      { timeoutTask.cancel(); Monitor.getMonitor().getTimerFactory(). getCancellationTimer().purge(); }

      isFinished = true;
      }

      As a temporary workaround, applications can do this themselves, i.e.
      add something like the following whenever they close a Connection:

      import org.apache.derby.iapi.services.monitor.Monitor;
      Monitor.getMonitor().getTimerFactory().getCancellationTimer().purge();

      1. derby-4137-2a-reduce_memory_footprint.diff
        7 kB
        Kristian Waagan
      2. derby-4137-1b-purge_on_cancel.diff
        8 kB
        Kristian Waagan
      3. derby-4137-1a-purge_on_cancel.diff
        7 kB
        Kristian Waagan
      4. derby-4137-1a-purge_on_cancel.stat
        0.2 kB
        Kristian Waagan

        Issue Links

          Activity

          Hide
          Kristian Waagan added a comment -

          Merged fixes back to 10.5:
          10.8: r1142956 (with manual conflict resolution in memoy/_Suite class)
          10.7: r1142957
          10.6: r1142958
          10.5: r1142960

          Got merge conflicts on 10.4.

          Show
          Kristian Waagan added a comment - Merged fixes back to 10.5: 10.8: r1142956 (with manual conflict resolution in memoy/_Suite class) 10.7: r1142957 10.6: r1142958 10.5: r1142960 Got merge conflicts on 10.4.
          Hide
          Knut Anders Hatlen added a comment -

          When backporting this issue, include the change in DERBY-5291 to disable the test on Java ME.

          Show
          Knut Anders Hatlen added a comment - When backporting this issue, include the change in DERBY-5291 to disable the test on Java ME.
          Hide
          Kristian Waagan added a comment -

          Committed patch 2a to trunk with revision 1136363.
          Awaiting test results from nightlies before backporting.

          Show
          Kristian Waagan added a comment - Committed patch 2a to trunk with revision 1136363. Awaiting test results from nightlies before backporting.
          Hide
          Kristian Waagan added a comment -

          Attaching patch 2a, which reduces the footprint of the cancellation task object.

          I made the cancellation task class static, and made it accept a reference to the XAState object. When the task is canceled, which happens when the XA transaction succeeds within the timeout value, the reference to the XAState object is cleared.

          As mentioned earlier, the memory usage when using XA timeouts is a function of the transaction rate and the timeout value. Setting the timeout value very high is not a good idea (i.e. several hours or days).

          If this fix is insufficient, we can as already mentioned add the purge functionality.

          suites.All and 'ant junit-lowmem' passed.
          Patch ready for review.

          Show
          Kristian Waagan added a comment - Attaching patch 2a, which reduces the footprint of the cancellation task object. I made the cancellation task class static, and made it accept a reference to the XAState object. When the task is canceled, which happens when the XA transaction succeeds within the timeout value, the reference to the XAState object is cleared. As mentioned earlier, the memory usage when using XA timeouts is a function of the transaction rate and the timeout value. Setting the timeout value very high is not a good idea (i.e. several hours or days). If this fix is insufficient, we can as already mentioned add the purge functionality. suites.All and 'ant junit-lowmem' passed. Patch ready for review.
          Hide
          Knut Anders Hatlen added a comment -

          So it sounds like clearing the reference should at least reduce the problem considerably. It's worth trying out. If it turns out not to be enough in all scenarios, we can always add the call to purge() back. Given that the timer queue isn't very long, it might be an acceptable cost to do the purging every time an XA transaction is committed/aborted when a tx timeout is enabled.

          Show
          Knut Anders Hatlen added a comment - So it sounds like clearing the reference should at least reduce the problem considerably. It's worth trying out. If it turns out not to be enough in all scenarios, we can always add the call to purge() back. Given that the timer queue isn't very long, it might be an acceptable cost to do the purging every time an XA transaction is committed/aborted when a tx timeout is enabled.
          Hide
          Kristian Waagan added a comment -

          > Does the OOME happen because there are too many CancelXATransactionTask objects, or because the CancelXATransactionTask objects reference too many other objects via the XATransactionState?

          Both, but the primary factor is the objects referenced via XATransactionState.

          • Current (top three at OOME):
            class [B 91172 7859790
            class org.apache.derby.jdbc.XATransactionState 43581 1830402
            class org.apache.derby.jdbc.XATransactionState$CancelXATransactionTask 43581 1220268
          • Nullifying reference to XATransactionState (top three at OOME)
            class org.apache.derby.jdbc.XATransactionState$CancelXATransactionTask 261134 7311752
            class [B 4010 1584124
            class [Ljava.util.TimerTask; 1 1048584

          The memory requirements are decided by the transaction rate (# of objects) and the timeout value (lifetime of objects).
          CancelXATransactionTask seems to occupy 28 bytes, which is rather small. I have no idea what kind of transaction rates and timeout values people are using, but maybe the alternative solution you are proposing will be good enough for most use cases?

          I'll write a patch for it.

          Show
          Kristian Waagan added a comment - > Does the OOME happen because there are too many CancelXATransactionTask objects, or because the CancelXATransactionTask objects reference too many other objects via the XATransactionState? Both, but the primary factor is the objects referenced via XATransactionState. Current (top three at OOME): class [B 91172 7859790 class org.apache.derby.jdbc.XATransactionState 43581 1830402 class org.apache.derby.jdbc.XATransactionState$CancelXATransactionTask 43581 1220268 Nullifying reference to XATransactionState (top three at OOME) class org.apache.derby.jdbc.XATransactionState$CancelXATransactionTask 261134 7311752 class [B 4010 1584124 class [Ljava.util.TimerTask; 1 1048584 The memory requirements are decided by the transaction rate (# of objects) and the timeout value (lifetime of objects). CancelXATransactionTask seems to occupy 28 bytes, which is rather small. I have no idea what kind of transaction rates and timeout values people are using, but maybe the alternative solution you are proposing will be good enough for most use cases? I'll write a patch for it.
          Hide
          Knut Anders Hatlen added a comment -

          Thanks. The patch makes more sense to me now.

          Does the OOME happen because there are too many CancelXATransactionTask objects, or because the CancelXATransactionTask objects reference too many other objects via the XATransactionState? If the latter, would it be sufficient to clear the reference to the XATransactionState object when calling cancel()? I see that CancelQueryTask in GenericStatementContext uses that technique. If possible, it would be good to avoid calling the purge() method on every XA commit/rollback, both because of the complexity caused by reflection and because the javadoc warns about its performance impact.

          Show
          Knut Anders Hatlen added a comment - Thanks. The patch makes more sense to me now. Does the OOME happen because there are too many CancelXATransactionTask objects, or because the CancelXATransactionTask objects reference too many other objects via the XATransactionState? If the latter, would it be sufficient to clear the reference to the XATransactionState object when calling cancel()? I see that CancelQueryTask in GenericStatementContext uses that technique. If possible, it would be good to avoid calling the purge() method on every XA commit/rollback, both because of the complexity caused by reflection and because the javadoc warns about its performance impact.
          Hide
          Kristian Waagan added a comment -

          Thanks for catching that blunder, Knut!

          Attaching a revised patch 1b. I additionally fixed a typo, and increased and commented on the timeout value used in the test. The test didn't fail even though timeoutTask was always null because of the lowish timeout (10 seconds, the timer itself removed the tasks before an OOME occurred).

          Rerunning regression tests.
          Revised patch ready for review.

          Show
          Kristian Waagan added a comment - Thanks for catching that blunder, Knut! Attaching a revised patch 1b. I additionally fixed a typo, and increased and commented on the timeout value used in the test. The test didn't fail even though timeoutTask was always null because of the lowish timeout (10 seconds, the timer itself removed the tasks before an OOME occurred). Rerunning regression tests. Revised patch ready for review.
          Hide
          Knut Anders Hatlen added a comment -

          The patch removes the initialization of the timeoutTask field in CancelXATransactionTask's constructor, but I can't see that it's initialized anywhere else. Does it mean it's always null now?

          Show
          Knut Anders Hatlen added a comment - The patch removes the initialization of the timeoutTask field in CancelXATransactionTask's constructor, but I can't see that it's initialized anywhere else. Does it mean it's always null now?
          Hide
          Kristian Waagan added a comment -

          Created DERBY-5264 to track the fact that there is no planned fix for this in Derby when running with Java 1.4.

          Show
          Kristian Waagan added a comment - Created DERBY-5264 to track the fact that there is no planned fix for this in Derby when running with Java 1.4.
          Hide
          Kristian Waagan added a comment -

          Attaching patch 1a, which implements the fix suggested by Ronald. I also added a test to the lowmem suite.
          The fix will only be available when running with Java 1.5 or newer, so Java 1.4 will still have the bug.

          Patch ready for review.

          Show
          Kristian Waagan added a comment - Attaching patch 1a, which implements the fix suggested by Ronald. I also added a test to the lowmem suite. The fix will only be available when running with Java 1.5 or newer, so Java 1.4 will still have the bug. Patch ready for review.
          Hide
          Kristian Waagan added a comment -

          Note that the proposed fix uses a method introduced in Java SE 5.0. It'll be a while before we can drop J2SE 1.4 support...
          We can still make the fix work when using Java SE 5.0, and the workaround can also be used where appropriate.

          Show
          Kristian Waagan added a comment - Note that the proposed fix uses a method introduced in Java SE 5.0. It'll be a while before we can drop J2SE 1.4 support... We can still make the fix work when using Java SE 5.0, and the workaround can also be used where appropriate.
          Hide
          Rick Hillegas added a comment -

          Triaged July 2, 2009: Marked as Crash and assigned normal urgency.

          Show
          Rick Hillegas added a comment - Triaged July 2, 2009: Marked as Crash and assigned normal urgency.
          Hide
          Ronald Tschalaer added a comment -

          I just noticed that GenericStatementContext has potentially the same issue (I haven't delved
          deep enough to understand the lifecyle of instances of that class, so it may or may not be an
          issue there).

          Show
          Ronald Tschalaer added a comment - I just noticed that GenericStatementContext has potentially the same issue (I haven't delved deep enough to understand the lifecyle of instances of that class, so it may or may not be an issue there).

            People

            • Assignee:
              Kristian Waagan
              Reporter:
              Ronald Tschalaer
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development