Derby
  1. Derby
  2. DERBY-4711

Hung thread after another thread is interrupted

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 10.5.3.0
    • Fix Version/s: 10.5.3.2, 10.6.2.2, 10.7.1.1
    • Component/s: Services
    • Labels:
      None
    • Environment:
      Windows server 2008
    • Urgency:
      Normal
    • Bug behavior facts:
      Seen in production

      Description

      We've seen a problem today where we have several threads querying our database and when one gets interrupted the others are stuck waiting for a lock. Here is the stack trace for the stuck thread(s):

      daemon prio=4
      "DefaultExecutorService-pool-1-thread-47" Id=98 WAITING on org.apache.derby.impl.services.locks.ActiveLock@6e6f45a1
      at java.lang.Object.wait(Native Method)

      • waiting on org.apache.derby.impl.services.locks.ActiveLock@6e6f45a1
        at java.lang.Object.wait(Object.java:485)
        at org.apache.derby.impl.services.locks.ActiveLock.waitForGrant(ActiveLock.java:115)
        at org.apache.derby.impl.services.locks.ConcurrentLockSet.lockObject(ConcurrentLockSet.java:463)
        at org.apache.derby.impl.services.locks.ConcurrentLockSet.zeroDurationLockObject(ConcurrentLockSet.java:855)
        at org.apache.derby.impl.services.locks.AbstractPool.zeroDurationlockObject(AbstractPool.java:297)
        at org.apache.derby.impl.services.locks.ConcurrentPool.zeroDurationlockObject(ConcurrentPool.java:28)
        at org.apache.derby.impl.store.raw.xact.RowLocking2nohold.lockRecordForRead(RowLocking2nohold.java:89)
        at org.apache.derby.impl.store.access.heap.HeapController.lockRow(HeapController.java:520)
        at org.apache.derby.impl.store.access.heap.HeapController.lockRow(HeapController.java:638)
        at org.apache.derby.impl.store.access.btree.index.B2IRowLocking3.lockRowOnPage(B2IRowLocking3.java:309)
        at org.apache.derby.impl.store.access.btree.index.B2IRowLocking3._lockScanRow(B2IRowLocking3.java:599)
        at org.apache.derby.impl.store.access.btree.index.B2IRowLockingRR.lockScanRow(B2IRowLockingRR.java:105)
        at org.apache.derby.impl.store.access.btree.BTreeForwardScan.fetchRows(BTreeForwardScan.java:305)
        at org.apache.derby.impl.store.access.btree.BTreeScan.fetchNextGroup(BTreeScan.java:1585)
        at org.apache.derby.impl.sql.execute.BulkTableScanResultSet.reloadArray(BulkTableScanResultSet.java:327)
        at org.apache.derby.impl.sql.execute.BulkTableScanResultSet.getNextRowCore(BulkTableScanResultSet.java:282)
        at org.apache.derby.impl.sql.execute.BasicNoPutResultSetImpl.getNextRow(BasicNoPutResultSetImpl.java:460)
        at org.apache.derby.impl.jdbc.EmbedResultSet.movePosition(EmbedResultSet.java:427)
      • locked org.apache.derby.impl.jdbc.EmbedConnection40@445d374b
        at org.apache.derby.impl.jdbc.EmbedResultSet.next(EmbedResultSet.java:371)
        ...
      1. interrupted-exception-fix.diff
        2 kB
        Luke Quinane
      2. interrupted-exception-fix.patch
        2 kB
        Luke Quinane
      3. junit.diff
        8 kB
        Knut Anders Hatlen
      4. junit-timing-fix.diff
        0.8 kB
        Knut Anders Hatlen

        Issue Links

          Activity

          Hide
          Luke Quinane added a comment -

          My original patch.

          Show
          Luke Quinane added a comment - My original patch.
          Hide
          Luke Quinane added a comment -

          Kristian's updated version of the patch.

          Show
          Luke Quinane added a comment - Kristian's updated version of the patch.
          Hide
          Knut Anders Hatlen added a comment -

          Thanks, Luke! The fix looks correct to me. I'll run the regression test suites with the patch and I intend to commit the patch if the tests pass. I also have a half-finished JUnit test that exposes this bug in my sandbox. I'll finish it and attach it here.

          Show
          Knut Anders Hatlen added a comment - Thanks, Luke! The fix looks correct to me. I'll run the regression test suites with the patch and I intend to commit the patch if the tests pass. I also have a half-finished JUnit test that exposes this bug in my sandbox. I'll finish it and attach it here.
          Hide
          Knut Anders Hatlen added a comment -

          Luke, I've added you to the derby-developers group in JIRA so that you can assign issues to yourself. I assigned this issue to you.

          Show
          Knut Anders Hatlen added a comment - Luke, I've added you to the derby-developers group in JIRA so that you can assign issues to yourself. I assigned this issue to you.
          Hide
          Knut Anders Hatlen added a comment -

          Committed Luke's patch to trunk with revision 957902.

          Show
          Knut Anders Hatlen added a comment - Committed Luke's patch to trunk with revision 957902.
          Hide
          Knut Anders Hatlen added a comment -

          Here's a regression test case for the bug. Without the fix, the test completes in about 35 seconds and fails with this error:

          AssertionFailedError: Second thread needed 30008 ms to complete. Probably stuck waiting for a lock.

          With the fix, the test completes with no errors in about 5 seconds.

          Show
          Knut Anders Hatlen added a comment - Here's a regression test case for the bug. Without the fix, the test completes in about 35 seconds and fails with this error: AssertionFailedError: Second thread needed 30008 ms to complete. Probably stuck waiting for a lock. With the fix, the test completes with no errors in about 5 seconds.
          Hide
          Mike Matrigali added a comment -

          any chance to get comments in the code or in this issue about the cause of the problem and why this change fixes it?

          Show
          Mike Matrigali added a comment - any chance to get comments in the code or in this issue about the cause of the problem and why this change fixes it?
          Hide
          Knut Anders Hatlen added a comment -

          Hi Mike,

          I've added this comment along with the fix in LockSet and ConcurrentLockSet:

          + // DERBY-4711: If waitForGrant() fails, we need to
          + // remove ourselves from the queue so that those
          + // behind us in the queue don't get stuck waiting for
          + // us.

          The problem is that (Concurrent)LockSet.lockObject() only removes the waiter from the queue if the lock is obtained or if a timeout/deadlock exception is generated (using similar code further down in said method). When waitForGrant() fails (in this case with a StandardException wrapping an InterruptedException), it is not removed from the queue, and other threads are blocked by a lock request that's no longer alive. The fix removes the failed waiter from the queue, and allows the other threads to proceed.

          Hope this made things clearer.

          Show
          Knut Anders Hatlen added a comment - Hi Mike, I've added this comment along with the fix in LockSet and ConcurrentLockSet: + // DERBY-4711 : If waitForGrant() fails, we need to + // remove ourselves from the queue so that those + // behind us in the queue don't get stuck waiting for + // us. The problem is that (Concurrent)LockSet.lockObject() only removes the waiter from the queue if the lock is obtained or if a timeout/deadlock exception is generated (using similar code further down in said method). When waitForGrant() fails (in this case with a StandardException wrapping an InterruptedException), it is not removed from the queue, and other threads are blocked by a lock request that's no longer alive. The fix removes the failed waiter from the queue, and allows the other threads to proceed. Hope this made things clearer.
          Hide
          Knut Anders Hatlen added a comment -

          I've committed the JUnit test case to trunk with revision 958264.

          Show
          Knut Anders Hatlen added a comment - I've committed the JUnit test case to trunk with revision 958264.
          Hide
          Knut Anders Hatlen added a comment -

          A problem with the new test was reported on derby-dev: http://mail-archives.apache.org/mod_mbox/db-derby-dev/201006.mbox/%3C4C27C93C.2020403@gmail.com%3E

          The problem appears to be that the commit in the main thread may allow the waiting thread to continue before it has received the interrupt, and thus it doesn't fail as it is supposed to. Adding a sleep between t1.interrupt() and commit() gives the interrupt time to propagate to the waiter before the main thread commits and releases its locks. See the attached patch junit-timing-fix.diff.

          The test failed on almost every run in my environment when using insane jars. After patching the test, I've run it 200 times without seeing the failure.

          Show
          Knut Anders Hatlen added a comment - A problem with the new test was reported on derby-dev: http://mail-archives.apache.org/mod_mbox/db-derby-dev/201006.mbox/%3C4C27C93C.2020403@gmail.com%3E The problem appears to be that the commit in the main thread may allow the waiting thread to continue before it has received the interrupt, and thus it doesn't fail as it is supposed to. Adding a sleep between t1.interrupt() and commit() gives the interrupt time to propagate to the waiter before the main thread commits and releases its locks. See the attached patch junit-timing-fix.diff. The test failed on almost every run in my environment when using insane jars. After patching the test, I've run it 200 times without seeing the failure.
          Hide
          Knut Anders Hatlen added a comment -

          Committed junit-timing-fix.diff to trunk with revision 958508.

          Show
          Knut Anders Hatlen added a comment - Committed junit-timing-fix.diff to trunk with revision 958508.
          Hide
          Kristian Waagan added a comment -

          Reopening for backport to 10.6. One of the changes in the test harness is needed for another backport, and this fix seems nice to have in older branches too.

          Show
          Kristian Waagan added a comment - Reopening for backport to 10.6. One of the changes in the test harness is needed for another backport, and this fix seems nice to have in older branches too.
          Hide
          Kristian Waagan added a comment -

          Backported to 10.6 with revision 1040551.

          Show
          Kristian Waagan added a comment - Backported to 10.6 with revision 1040551.
          Hide
          Kathey Marsden added a comment -

          Reopen for backport

          Show
          Kathey Marsden added a comment - Reopen for backport
          Hide
          Mike Matrigali added a comment -

          temp assigning to myself while backporting to 10.5

          Show
          Mike Matrigali added a comment - temp assigning to myself while backporting to 10.5
          Hide
          Mike Matrigali added a comment -

          backported to 10.5. resolving and resetting original owner.

          Show
          Mike Matrigali added a comment - backported to 10.5. resolving and resetting original owner.
          Hide
          Knut Anders Hatlen added a comment -

          [bulk update] Close all resolved issues that haven't been updated for more than one year.

          Show
          Knut Anders Hatlen added a comment - [bulk update] Close all resolved issues that haven't been updated for more than one year.

            People

            • Assignee:
              Luke Quinane
              Reporter:
              Luke Quinane
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development