Derby
  1. Derby
  2. DERBY-5185

store/rollForwardRecovery.sql stuck in RAFContainer4.recoverContainerAfterInterrupt() during shutdown

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 10.9.1.0
    • Fix Version/s: 10.8.1.2, 10.9.1.0
    • Component/s: Store
    • Labels:
      None
    • Environment:

      Description

      I have a derbyall that has been running for more than two days now. It seems to be stuck in the store/rollForwardRecovery.sql test while the engine is shutting down.

      Here's the stack trace for the daemon thread that's stuck:

      "derby.rawStoreDaemon" daemon prio=10 tid=0xf3e7dc00 nid=0x3505 waiting on condition [0xf4066000]
      java.lang.Thread.State: TIMED_WAITING (sleeping)
      at java.lang.Thread.sleep(Native Method)
      at org.apache.derby.impl.store.raw.data.RAFContainer4.recoverContainerAfterInterrupt(Unknown Source)
      at org.apache.derby.impl.store.raw.data.RAFContainer4.readPage(Unknown Source)
      at org.apache.derby.impl.store.raw.data.RAFContainer4.readPage(Unknown Source)
      at org.apache.derby.impl.store.raw.data.CachedPage.readPage(Unknown Source)
      at org.apache.derby.impl.store.raw.data.CachedPage.setIdentity(Unknown Source)
      at org.apache.derby.impl.services.cache.ConcurrentCache.find(Unknown Source)
      at org.apache.derby.impl.store.raw.data.FileContainer.getAllocPage(Unknown Source)
      at org.apache.derby.impl.store.raw.data.BaseContainer.getAllocPage(Unknown Source)
      at org.apache.derby.impl.store.raw.data.BaseContainerHandle.getAllocPage(Unknown Source)
      at org.apache.derby.impl.store.raw.data.FileContainer.deallocatePagenum(Unknown Source)

      • locked <0xc5adbce8> (a org.apache.derby.impl.store.raw.data.AllocationCache)
        at org.apache.derby.impl.store.raw.data.FileContainer.deallocatePage(Unknown Source)
        at org.apache.derby.impl.store.raw.data.BaseContainer.removePage(Unknown Source)
        at org.apache.derby.impl.store.raw.data.BaseContainerHandle.removePage(Unknown Source)
        at org.apache.derby.impl.store.access.heap.HeapController.removePage(Unknown Source)
        at org.apache.derby.impl.store.access.heap.HeapPostCommit.purgeCommittedDeletes(Unknown Source)
        at org.apache.derby.impl.store.access.heap.HeapPostCommit.performWork(Unknown Source)
        at org.apache.derby.impl.services.daemon.BasicDaemon.serviceClient(Unknown Source)
        at org.apache.derby.impl.services.daemon.BasicDaemon.work(Unknown Source)
        at org.apache.derby.impl.services.daemon.BasicDaemon.run(Unknown Source)
        at java.lang.Thread.run(Thread.java:722)

      And here's the stack trace for the main thread, which is waiting for the daemon thread to stop:

      "main" prio=10 tid=0xf6c05c00 nid=0x34e5 in Object.wait() [0xf6dbe000]
      java.lang.Thread.State: WAITING (on object monitor)
      at java.lang.Object.wait(Native Method)

      • waiting on <0xc5ac5760> (a org.apache.derby.impl.services.daemon.BasicDaemon)
        at java.lang.Object.wait(Object.java:504)
        at org.apache.derby.impl.services.daemon.BasicDaemon.pause(Unknown Source)
      • locked <0xc5ac5760> (a org.apache.derby.impl.services.daemon.BasicDaemon)
        at org.apache.derby.impl.services.daemon.BasicDaemon.stop(Unknown Source)
        at org.apache.derby.impl.store.raw.RawStore.stop(Unknown Source)
        at org.apache.derby.impl.services.monitor.TopService.stop(Unknown Source)
        at org.apache.derby.impl.services.monitor.TopService.shutdown(Unknown Source)
        at org.apache.derby.impl.services.monitor.BaseMonitor.shutdown(Unknown Source)
        at org.apache.derby.impl.services.monitor.BaseMonitor.shutdown(Unknown Source)
        at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)
        at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source)
        at java.sql.DriverManager.getConnection(DriverManager.java:620)
        at java.sql.DriverManager.getConnection(DriverManager.java:222)
        at org.apache.derby.impl.tools.ij.utilMain.cleanupGo(Unknown Source)
        at org.apache.derby.impl.tools.ij.utilMain.go(Unknown Source)
        at org.apache.derby.impl.tools.ij.Main.go(Unknown Source)
        at org.apache.derby.impl.tools.ij.Main.mainCore(Unknown Source)
        at org.apache.derby.impl.tools.ij.Main.main(Unknown Source)
        at org.apache.derby.tools.ij.main(Unknown Source)
      1. derby-5185-1a.diff
        1 kB
        Dag H. Wanvik
      2. derby-5185-2a.diff
        3 kB
        Dag H. Wanvik
      3. derby-5185-2a.stat
        0.1 kB
        Dag H. Wanvik
      4. derby-5185-2b.diff
        3 kB
        Dag H. Wanvik
      5. derby-5185-3a.diff
        2 kB
        Dag H. Wanvik
      6. stack.txt
        5 kB
        Knut Anders Hatlen

        Issue Links

          Activity

          Hide
          Knut Anders Hatlen added a comment -

          Uploading a stack dump for the process that hangs.

          Show
          Knut Anders Hatlen added a comment - Uploading a stack dump for the process that hangs.
          Hide
          Dag H. Wanvik added a comment -

          The hang occurs in this piece of looping code (cf the sleep)

          // Wait till other concurrent threads hit the wall
          // (ClosedChannelException) and are a ready wait for us to clean up, so
          // we can set them loose when we're done.
          while (true) {
          synchronized (channelCleanupMonitor) {
          if (threadsInPageIO == 0)

          { // Either no concurrent threads, or they are now waiting // for us to clean up (see ClosedChannelException case) break; }

          }

          try

          { Thread.sleep(10); }

          catch (InterruptedException te)

          { InterruptStatus.setInterrupted(); }

          }

          Looking at VM running, we can see that there are only two Derby threads still live, and only the one that loops is in the NIO code.
          This means that the counter will never reach zero. Again looking at the live VM's heap, there are several containers whose counters are positive, which they are not supposed to be, since we have only this one thread inside the NIO code. The container for whom the CacheEntry#keepCount > 0 does have a value threadsInPageIO > 0. This is probably the one that sees the hang during HeapPostCommit#purgeComittedDeletes.

          This is not good, since this happens during shutdown without any interrupts being issued by the application code (Derby uses interrupts during shutdown to stop threads). So, this is a regression in that earlier, this would only result in an IO error during shutdown (masked), wheres as now we see a hang.

          I will a) commit a patch to make the loop bounded so we can get the regression out of the way, then b) look for the reason why the counter gets wrong.

          Show
          Dag H. Wanvik added a comment - The hang occurs in this piece of looping code (cf the sleep) // Wait till other concurrent threads hit the wall // (ClosedChannelException) and are a ready wait for us to clean up, so // we can set them loose when we're done. while (true) { synchronized (channelCleanupMonitor) { if (threadsInPageIO == 0) { // Either no concurrent threads, or they are now waiting // for us to clean up (see ClosedChannelException case) break; } } try { Thread.sleep(10); } catch (InterruptedException te) { InterruptStatus.setInterrupted(); } } Looking at VM running, we can see that there are only two Derby threads still live, and only the one that loops is in the NIO code. This means that the counter will never reach zero. Again looking at the live VM's heap, there are several containers whose counters are positive, which they are not supposed to be, since we have only this one thread inside the NIO code. The container for whom the CacheEntry#keepCount > 0 does have a value threadsInPageIO > 0. This is probably the one that sees the hang during HeapPostCommit#purgeComittedDeletes. This is not good, since this happens during shutdown without any interrupts being issued by the application code (Derby uses interrupts during shutdown to stop threads). So, this is a regression in that earlier, this would only result in an IO error during shutdown (masked), wheres as now we see a hang. I will a) commit a patch to make the loop bounded so we can get the regression out of the way, then b) look for the reason why the counter gets wrong.
          Hide
          Dag H. Wanvik added a comment -

          Uploading a patch which makes the loop in question bounded (maxes out at 60s): derby-5185-1a, running regressions.

          Show
          Dag H. Wanvik added a comment - Uploading a patch which makes the loop in question bounded (maxes out at 60s): derby-5185-1a, running regressions.
          Hide
          Dag H. Wanvik added a comment -

          Regressions passed.

          Show
          Dag H. Wanvik added a comment - Regressions passed.
          Hide
          Dag H. Wanvik added a comment -

          Committed to trunk as svn 1091221.

          Show
          Dag H. Wanvik added a comment - Committed to trunk as svn 1091221.
          Hide
          Dag H. Wanvik added a comment -

          Backported to 10.8 as svn 1091258.

          Show
          Dag H. Wanvik added a comment - Backported to 10.8 as svn 1091258.
          Hide
          Knut Anders Hatlen added a comment -

          I saw a new hang in rollForwardRecovery.sql after the first fix went in. On the same machine, but this time with OpenJDK 6. The hang now had moved to a different location:

          "main" prio=10 tid=0x00007f9f0c009000 nid=0x571 in Object.wait() [0x00007f9f123eb000]
          java.lang.Thread.State: WAITING (on object monitor)
          at java.lang.Object.wait(Native Method)

          • waiting on <0x00007f9ee3eccf58> (a java.lang.Object)
            at java.lang.Object.wait(Object.java:502)
            at org.apache.derby.impl.store.raw.data.RAFContainer4.writePage(Unknown Source)
          • locked <0x00007f9ee3eccf58> (a java.lang.Object)
            at org.apache.derby.impl.store.raw.data.CachedPage.writePage(Unknown Source)
            at org.apache.derby.impl.store.raw.data.CachedPage.clean(Unknown Source)
            at org.apache.derby.impl.services.cache.ConcurrentCache.cleanAndUnkeepEntry(Unknown Source)
            at org.apache.derby.impl.services.cache.ConcurrentCache.cleanCache(Unknown Source)
            at org.apache.derby.impl.services.cache.ConcurrentCache.cleanAll(Unknown Source)
            at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.checkpoint(Unknown Source)
            at org.apache.derby.impl.store.raw.log.LogToFile.checkpointWithTran(Unknown Source)
            at org.apache.derby.impl.store.raw.log.LogToFile.checkpoint(Unknown Source)
            at org.apache.derby.impl.store.raw.RawStore.stop(Unknown Source)
            at org.apache.derby.impl.services.monitor.TopService.stop(Unknown Source)
            at org.apache.derby.impl.services.monitor.TopService.shutdown(Unknown Source)
            at org.apache.derby.impl.services.monitor.BaseMonitor.shutdown(Unknown Source)
            at org.apache.derby.impl.services.monitor.BaseMonitor.shutdown(Unknown Source)
            at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)
            at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source)
            at java.sql.DriverManager.getConnection(DriverManager.java:620)
            at java.sql.DriverManager.getConnection(DriverManager.java:222)
            at org.apache.derby.impl.tools.ij.utilMain.cleanupGo(Unknown Source)
            at org.apache.derby.impl.tools.ij.utilMain.go(Unknown Source)
            at org.apache.derby.impl.tools.ij.Main.go(Unknown Source)
            at org.apache.derby.impl.tools.ij.Main.mainCore(Unknown Source)
            at org.apache.derby.impl.tools.ij.Main.main(Unknown Source)
            at org.apache.derby.tools.ij.main(Unknown Source)
          Show
          Knut Anders Hatlen added a comment - I saw a new hang in rollForwardRecovery.sql after the first fix went in. On the same machine, but this time with OpenJDK 6. The hang now had moved to a different location: "main" prio=10 tid=0x00007f9f0c009000 nid=0x571 in Object.wait() [0x00007f9f123eb000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) waiting on <0x00007f9ee3eccf58> (a java.lang.Object) at java.lang.Object.wait(Object.java:502) at org.apache.derby.impl.store.raw.data.RAFContainer4.writePage(Unknown Source) locked <0x00007f9ee3eccf58> (a java.lang.Object) at org.apache.derby.impl.store.raw.data.CachedPage.writePage(Unknown Source) at org.apache.derby.impl.store.raw.data.CachedPage.clean(Unknown Source) at org.apache.derby.impl.services.cache.ConcurrentCache.cleanAndUnkeepEntry(Unknown Source) at org.apache.derby.impl.services.cache.ConcurrentCache.cleanCache(Unknown Source) at org.apache.derby.impl.services.cache.ConcurrentCache.cleanAll(Unknown Source) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.checkpoint(Unknown Source) at org.apache.derby.impl.store.raw.log.LogToFile.checkpointWithTran(Unknown Source) at org.apache.derby.impl.store.raw.log.LogToFile.checkpoint(Unknown Source) at org.apache.derby.impl.store.raw.RawStore.stop(Unknown Source) at org.apache.derby.impl.services.monitor.TopService.stop(Unknown Source) at org.apache.derby.impl.services.monitor.TopService.shutdown(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.shutdown(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.shutdown(Unknown Source) at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source) at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source) at java.sql.DriverManager.getConnection(DriverManager.java:620) at java.sql.DriverManager.getConnection(DriverManager.java:222) at org.apache.derby.impl.tools.ij.utilMain.cleanupGo(Unknown Source) at org.apache.derby.impl.tools.ij.utilMain.go(Unknown Source) at org.apache.derby.impl.tools.ij.Main.go(Unknown Source) at org.apache.derby.impl.tools.ij.Main.mainCore(Unknown Source) at org.apache.derby.impl.tools.ij.Main.main(Unknown Source) at org.apache.derby.tools.ij.main(Unknown Source)
          Hide
          Knut Anders Hatlen added a comment -

          I filed DERBY-5190 for the new hang.

          Show
          Knut Anders Hatlen added a comment - I filed DERBY-5190 for the new hang.
          Hide
          Dag H. Wanvik added a comment -

          Scenario: A thread has seen an interrupt and is getting ready to resurrect the channel in "recoverContainerAfterInterrupt". Before it starts doing this, however, it waits for any other threads currently doing IO on this container to "hit the wall" and start waiting for the current thread to do the recovery. The current thread knows its free to proceed when the counter "threadsInPageIO" reaches 0. In this case, it has given up waiting for the counter the reach 0 and throws FILE_IO_INTERRUPTED. (I believe "threadsInPageIO" should have been 0 here but is not for some reason, see below). In throwing, it neglects to reset state variable "restoreChannelInProgress" which makes the (next) thread coming along, which we see hanging in this issue get stuck when trying to enter the IO code in the "gain entry" section.

          Attaching a patch derby-5185-2a, which fixes state invariant maintenance when throwing FILE_IO_INTERRUPTED. It also adds a maximum number of retries for the readPage code abd fixes some cases whereby the state variable "threadsInPageIO" could risk not being properly update when exceptions would get thrown. This may be the underlying reason for what we see here.

          Ran regressions OK on Solaris/JDK6 and Debian/JDK7.

          Show
          Dag H. Wanvik added a comment - Scenario: A thread has seen an interrupt and is getting ready to resurrect the channel in "recoverContainerAfterInterrupt". Before it starts doing this, however, it waits for any other threads currently doing IO on this container to "hit the wall" and start waiting for the current thread to do the recovery. The current thread knows its free to proceed when the counter "threadsInPageIO" reaches 0. In this case, it has given up waiting for the counter the reach 0 and throws FILE_IO_INTERRUPTED. (I believe "threadsInPageIO" should have been 0 here but is not for some reason, see below). In throwing, it neglects to reset state variable "restoreChannelInProgress" which makes the (next) thread coming along, which we see hanging in this issue get stuck when trying to enter the IO code in the "gain entry" section. Attaching a patch derby-5185-2a, which fixes state invariant maintenance when throwing FILE_IO_INTERRUPTED. It also adds a maximum number of retries for the readPage code abd fixes some cases whereby the state variable "threadsInPageIO" could risk not being properly update when exceptions would get thrown. This may be the underlying reason for what we see here. Ran regressions OK on Solaris/JDK6 and Debian/JDK7.
          Hide
          Knut Anders Hatlen added a comment -

          Hi Dag,

          Using try/finally to maintain threadsInPageIO sounds like a good approach. The patch looks good to me. +1 to commit.

          Show
          Knut Anders Hatlen added a comment - Hi Dag, Using try/finally to maintain threadsInPageIO sounds like a good approach. The patch looks good to me. +1 to commit.
          Hide
          Dag H. Wanvik added a comment -

          Uploading and committed variant 2b of the patch (2a had a line missing) at svn 1094572.

          Show
          Dag H. Wanvik added a comment - Uploading and committed variant 2b of the patch (2a had a line missing) at svn 1094572.
          Hide
          Dag H. Wanvik added a comment -

          Backported this fix to the 10.8 branch as svn 1094585.

          Show
          Dag H. Wanvik added a comment - Backported this fix to the 10.8 branch as svn 1094585.
          Hide
          Dag H. Wanvik added a comment -

          Since we don't have a lot of time to gain confidence in the code right now with the impending release, I'd like to avoid the possibility of further hangs in the interrupt handling by committing a follow-up patch which adds max waiting time logic to the two remaining loops in the code for threads waiting for a condition to become true: the "gain entry" logic in readPage and writePage. It should not be necessary now that we have fixed the state maintenance code, but just to be safe.. In normal operation it does not add overhead (just one extra counter comparison) so it seems a cheap price to pay to avoid the prospect of a hang during shutdown - (which constitute regression). If we max out we will throw FILE_IO_INTERRUPTED as elsewhere. Uploading derby-5185-3a for this.

          Show
          Dag H. Wanvik added a comment - Since we don't have a lot of time to gain confidence in the code right now with the impending release, I'd like to avoid the possibility of further hangs in the interrupt handling by committing a follow-up patch which adds max waiting time logic to the two remaining loops in the code for threads waiting for a condition to become true: the "gain entry" logic in readPage and writePage. It should not be necessary now that we have fixed the state maintenance code, but just to be safe.. In normal operation it does not add overhead (just one extra counter comparison) so it seems a cheap price to pay to avoid the prospect of a hang during shutdown - (which constitute regression). If we max out we will throw FILE_IO_INTERRUPTED as elsewhere. Uploading derby-5185-3a for this.
          Hide
          Dag H. Wanvik added a comment -

          Regressions ran with the 3a patch:

          • OK on Solaris x86/JDK6 and
          • with one error which I believe is unrelated, DERBY-5197, on Debian/JDK7,

          both with insane jars.

          Show
          Dag H. Wanvik added a comment - Regressions ran with the 3a patch: OK on Solaris x86/JDK6 and with one error which I believe is unrelated, DERBY-5197 , on Debian/JDK7, both with insane jars.
          Hide
          Dag H. Wanvik added a comment -

          Committed as svn 1094728. on trunk and svn 1094730. on 10.8 branch.

          Show
          Dag H. Wanvik added a comment - Committed as svn 1094728. on trunk and svn 1094730. on 10.8 branch.
          Hide
          Knut Anders Hatlen added a comment -

          Thanks, Dag. The 3a patch looks like a good precaution to prevent hangs in case there are other similar problems. I've run suites.All and derbyall multiple times on the platforms on which I saw the hangs, and I haven't seen any problems so far.

          Show
          Knut Anders Hatlen added a comment - Thanks, Dag. The 3a patch looks like a good precaution to prevent hangs in case there are other similar problems. I've run suites.All and derbyall multiple times on the platforms on which I saw the hangs, and I haven't seen any problems so far.

            People

            • Assignee:
              Dag H. Wanvik
              Reporter:
              Knut Anders Hatlen
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development