Derby
  1. Derby
  2. DERBY-4920

suites.All stuck in RAFContainer4.awaitRestoreChannel()

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 10.8.1.2
    • Fix Version/s: 10.8.1.2
    • Component/s: Store
    • Labels:
      None

      Description

      I have suites.All running on trunk, and it seems to be stuck in a call to RAFContainer4.awaitRestoreChannel(). Here's the stack for the waiting thread:

      "main" prio=3 tid=0x08070800 nid=0x2 waiting on condition [0xfe61e000]
      java.lang.Thread.State: TIMED_WAITING (sleeping)
      at java.lang.Thread.sleep(Native Method)
      at org.apache.derby.impl.store.raw.data.RAFContainer4.awaitRestoreChannel(RAFContainer4.java:698)
      at org.apache.derby.impl.store.raw.data.RAFContainer4.writePage(RAFContainer4.java:593)
      at org.apache.derby.impl.store.raw.data.CachedPage.writePage(CachedPage.java:787)
      at org.apache.derby.impl.store.raw.data.CachedPage.clean(CachedPage.java:610)
      at org.apache.derby.impl.services.cache.ConcurrentCache.cleanAndUnkeepEntry(ConcurrentCache.java:551)
      at org.apache.derby.impl.services.cache.ConcurrentCache.cleanCache(ConcurrentCache.java:509)
      at org.apache.derby.impl.services.cache.ConcurrentCache.cleanAll(ConcurrentCache.java:460)
      at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.checkpoint(BaseDataFileFactory.java:1211)
      at org.apache.derby.impl.store.raw.log.LogToFile.checkpointWithTran(LogToFile.java:1710)
      at org.apache.derby.impl.store.raw.log.LogToFile.checkpoint(LogToFile.java:1507)
      at org.apache.derby.impl.store.raw.RawStore.stop(RawStore.java:368)
      at org.apache.derby.impl.services.monitor.TopService.stop(TopService.java:442)
      at org.apache.derby.impl.services.monitor.TopService.shutdown(TopService.java:393)
      at org.apache.derby.impl.services.monitor.BaseMonitor.shutdown(BaseMonitor.java:229)
      at org.apache.derby.impl.services.monitor.BaseMonitor.shutdown(BaseMonitor.java:199)
      at org.apache.derby.jdbc.InternalDriver.connect(InternalDriver.java:231)
      at org.apache.derby.jdbc.AutoloadedDriver.connect(AutoloadedDriver.java:119)
      at java.sql.DriverManager.getConnection(DriverManager.java:582)
      at java.sql.DriverManager.getConnection(DriverManager.java:154)
      at org.apache.derbyTesting.junit.DriverManagerConnector.getConnectionByAttributes(DriverManagerConnector.java:137)
      at org.apache.derbyTesting.junit.DriverManagerConnector.shutEngine(DriverManagerConnector.java:120)
      at org.apache.derbyTesting.junit.TestConfiguration.shutdownEngine(TestConfiguration.java:1599)
      at org.apache.derbyTesting.junit.SystemPropertyTestSetup.setUp(SystemPropertyTestSetup.java:83)
      at junit.extensions.TestSetup$1.protect(TestSetup.java:20)
      at junit.framework.TestResult.runProtected(TestResult.java:124)
      at junit.extensions.TestSetup.run(TestSetup.java:25)
      at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
      at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
      at junit.framework.TestResult.runProtected(TestResult.java:124)
      at junit.extensions.TestSetup.run(TestSetup.java:25)
      at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57)
      at junit.framework.TestSuite.runTest(TestSuite.java:230)
      at junit.framework.TestSuite.run(TestSuite.java:225)
      at junit.framework.TestSuite.runTest(TestSuite.java:230)
      at junit.framework.TestSuite.run(TestSuite.java:225)
      at junit.framework.TestSuite.runTest(TestSuite.java:230)
      at junit.framework.TestSuite.run(TestSuite.java:225)
      at junit.textui.TestRunner.doRun(TestRunner.java:121)
      at junit.textui.TestRunner.start(TestRunner.java:185)
      at junit.textui.TestRunner.main(TestRunner.java:143)

      1. DERBY-4920_db2compatibility.txt
        3 kB
        Myrna van Lunteren
      2. derby-4920-a.diff
        4 kB
        Dag H. Wanvik
      3. derby-4920-a.stat
        0.1 kB
        Dag H. Wanvik
      4. jstack.19071.txt
        11 kB
        Knut Anders Hatlen
      5. RAFContainer4-instrumentation.diff
        11 kB
        Dag H. Wanvik

        Activity

        Hide
        Dag H. Wanvik added a comment - - edited

        Committed as svn 1049150, resolving

        Show
        Dag H. Wanvik added a comment - - edited Committed as svn 1049150, resolving
        Hide
        Dag H. Wanvik added a comment -

        Regressions passed.nI also ran LobSortTest with the config that hang (Solaris, on /tmp) in a loop for a while without seeing the hang after I applied the patch.

        I will commit this tomorrow.

        Show
        Dag H. Wanvik added a comment - Regressions passed.nI also ran LobSortTest with the config that hang (Solaris, on /tmp) in a loop for a while without seeing the hang after I applied the patch. I will commit this tomorrow.
        Hide
        Dag H. Wanvik added a comment -

        I was finally able to track this one down by instrumenting RAFContainer4 to collect a bit of history, cf. uploaded patch "RAFContainer4-instrumentation". (Thanks to Knut for noticing it happens more often when running on an in-memory file system e.g. /tmp on Solaris). Turned out the timing was slightly different there: instead of happening very infrequently, it happened more often than not when running the test that we saw hang: LobSortTest.

        (Aside: trying to reproduce on Solaris made it even harder since the interrupt would sometimes get swallowed by non-NIO IO calls: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6871306)

        In any case, this is what happens: The test shuts down the database to modify static properties ("SystemPropertyTestSetup") after having created the table MIXED_LOBS in CleanDatabaseTestSetup#decorateSQL. The raw store demon thread is performing a check-point at the time the user thread shuts down Derby. The shutdown interrupts the demon while it is doing NIO. The demon tries to recover the channel. This fails due a ShutdownException thrown when InterruptStatus.setInterrupted tries to find the lcc by calling ContextService#getContextOrNull. (RAFContainer4#recoverContainerAfterInterrupt calls noteAndClearInterrupt. The latter in turn calls setInterrupted.)

        Later during shutdown, the user thread detects that there are still dirty pages and tries (again) to perform the check-point, but now the channel is closed, since the demon's recovery failed. This leads to the hang because the user thread keeps waiting for recovery to happen.

        The fix is simple, just make ContextService#getContextOrNull ignore ShutdownException (as we did for DERBY-4911).

        Note that this conforms my hypothesis that the check-pointing at shutdown has been faulty since we introduced NIO containers in 10.2: The interrupted state of the channel was masked by the shutdown. It would only ever be an issue if there was a background thread doing check-pointing at shutdown time, however. The next reboot would read the logs and recover, of course, so that's probably why we never have had a report on this.

        Patch details:

        • InterruptStatus.java

        Make sure we always catch ShutdownException when we access the context looking for lcc.

        • RAFContainer4.java

        Improved some comments, added a final, added a max count for a retry loop that missed it: will throw FILE_IO_INTERRUPTED if it exceeds MAX_INTERRUPT_RETRIES (as elsewhere when we give up).

        Running regressions again.

        Show
        Dag H. Wanvik added a comment - I was finally able to track this one down by instrumenting RAFContainer4 to collect a bit of history, cf. uploaded patch "RAFContainer4-instrumentation". (Thanks to Knut for noticing it happens more often when running on an in-memory file system e.g. /tmp on Solaris). Turned out the timing was slightly different there: instead of happening very infrequently, it happened more often than not when running the test that we saw hang: LobSortTest. (Aside: trying to reproduce on Solaris made it even harder since the interrupt would sometimes get swallowed by non-NIO IO calls: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6871306 ) In any case, this is what happens: The test shuts down the database to modify static properties ("SystemPropertyTestSetup") after having created the table MIXED_LOBS in CleanDatabaseTestSetup#decorateSQL. The raw store demon thread is performing a check-point at the time the user thread shuts down Derby. The shutdown interrupts the demon while it is doing NIO. The demon tries to recover the channel. This fails due a ShutdownException thrown when InterruptStatus.setInterrupted tries to find the lcc by calling ContextService#getContextOrNull. (RAFContainer4#recoverContainerAfterInterrupt calls noteAndClearInterrupt. The latter in turn calls setInterrupted.) Later during shutdown, the user thread detects that there are still dirty pages and tries (again) to perform the check-point, but now the channel is closed, since the demon's recovery failed. This leads to the hang because the user thread keeps waiting for recovery to happen. The fix is simple, just make ContextService#getContextOrNull ignore ShutdownException (as we did for DERBY-4911 ). Note that this conforms my hypothesis that the check-pointing at shutdown has been faulty since we introduced NIO containers in 10.2: The interrupted state of the channel was masked by the shutdown. It would only ever be an issue if there was a background thread doing check-pointing at shutdown time, however. The next reboot would read the logs and recover, of course, so that's probably why we never have had a report on this. Patch details: InterruptStatus.java Make sure we always catch ShutdownException when we access the context looking for lcc. RAFContainer4.java Improved some comments, added a final, added a max count for a retry loop that missed it: will throw FILE_IO_INTERRUPTED if it exceeds MAX_INTERRUPT_RETRIES (as elsewhere when we give up). Running regressions again.
        Hide
        Dag H. Wanvik added a comment -

        The test which I see hanging is LobSortTest. The test uses a pseudo-random seed and it also claims to be sensitive to heap size, which could explain why we see the hang only now and then. Possible I can make it reproducible by reusing the seed seen when it hangs, we shall see.

        Show
        Dag H. Wanvik added a comment - The test which I see hanging is LobSortTest. The test uses a pseudo-random seed and it also claims to be sensitive to heap size, which could explain why we see the hang only now and then. Possible I can make it reproducible by reusing the seed seen when it hangs, we shall see.
        Hide
        Dag H. Wanvik added a comment -

        Thanks for that observation, I'll try that to increase turn-around while tracking this down.

        Show
        Dag H. Wanvik added a comment - Thanks for that observation, I'll try that to increase turn-around while tracking this down.
        Hide
        Kristian Waagan added a comment -

        Same with me. I ran with the parallel runner (slightly modified) that Knut posted a while back. The stack trace was the same as the one posted above.

        Show
        Kristian Waagan added a comment - Same with me. I ran with the parallel runner (slightly modified) that Knut posted a while back. The stack trace was the same as the one posted above.
        Hide
        Knut Anders Hatlen added a comment -

        I just got the hang again when running the jdbc4 suite separately, so it seems we don't need to run the full suites.All to reproduce it, at least.

        Show
        Knut Anders Hatlen added a comment - I just got the hang again when running the jdbc4 suite separately, so it seems we don't need to run the full suites.All to reproduce it, at least.
        Hide
        Dag H. Wanvik added a comment -

        Preliminary result of a history collection patch seems to indicate that the container was closed due to an in-line compress prior to the shutdown's checkpointing. Weird. Digging further..

        Show
        Dag H. Wanvik added a comment - Preliminary result of a history collection patch seems to indicate that the container was closed due to an in-line compress prior to the shutdown's checkpointing. Weird. Digging further..
        Hide
        Dag H. Wanvik added a comment -

        Over the week-end I have been able to reproduce this on two different machines, thanks for your help so far. As in Knut's observation it happens in the setup for jdbc4/LobSortTest. On one of the machines there was a lingering NetworkServerControl still running.

        Show
        Dag H. Wanvik added a comment - Over the week-end I have been able to reproduce this on two different machines, thanks for your help so far. As in Knut's observation it happens in the setup for jdbc4/LobSortTest. On one of the machines there was a lingering NetworkServerControl still running.
        Hide
        Myrna van Lunteren added a comment -

        I saw this yesterday in our nightly test of trunk on linux; in derbyall test lang/db2Compatibility.sql.
        This was with ibm 1.5 (SR11 FP1).

        I'm attaching the stack of the hanging process main thread...

        Show
        Myrna van Lunteren added a comment - I saw this yesterday in our nightly test of trunk on linux; in derbyall test lang/db2Compatibility.sql. This was with ibm 1.5 (SR11 FP1). I'm attaching the stack of the hanging process main thread...
        Hide
        Lily Wei added a comment -

        Why will RAFContainer4.awaitRestoreChannel waiting when jdbc4/LobSortTest is just setting up? Was there some transaction not being finish somewhere? Hmm... It is really odd. I tried to reproduce it by having another connection open while running jdbc4/LobSortTest. No luck though. ...

        Show
        Lily Wei added a comment - Why will RAFContainer4.awaitRestoreChannel waiting when jdbc4/LobSortTest is just setting up? Was there some transaction not being finish somewhere? Hmm... It is really odd. I tried to reproduce it by having another connection open while running jdbc4/LobSortTest. No luck though. ...
        Hide
        Dag H. Wanvik added a comment -

        Thanks, Knut! I'll start working on a diagnostic patch to collect some info so we can know more when it happens.

        Show
        Dag H. Wanvik added a comment - Thanks, Knut! I'll start working on a diagnostic patch to collect some info so we can know more when it happens.
        Hide
        Knut Anders Hatlen added a comment -

        The Tinderbox got stuck again when testing revision 1041375. It looks like it happens in the setup for jdbc4/LobSortTest.

        Attaching thread dump (jstack.19071.txt). It happened on a Solaris 10 machine with Java version 1.6.0_18 (32-bit). suites.All ran with these flags: -Xmx512M -XX:MaxPermSize=256M -DderbyTesting.oldReleasePath=/usr/local/share/java/derby/lib -Dderby.tests.trace=true -Dderby.tests.ThreadsMinutes=33x3

        Show
        Knut Anders Hatlen added a comment - The Tinderbox got stuck again when testing revision 1041375. It looks like it happens in the setup for jdbc4/LobSortTest. Attaching thread dump (jstack.19071.txt). It happened on a Solaris 10 machine with Java version 1.6.0_18 (32-bit). suites.All ran with these flags: -Xmx512M -XX:MaxPermSize=256M -DderbyTesting.oldReleasePath=/usr/local/share/java/derby/lib -Dderby.tests.trace=true -Dderby.tests.ThreadsMinutes=33x3
        Hide
        Dag H. Wanvik added a comment - - edited

        Interestingly, this happens after an interrupt sent by Derby to itself during shutdown, cf. BaseMonitor#shutdown ca line 176, where we see:

        // Shutdown all threads by iterrupting them
        contextService.notifyAllActiveThreads((Context) null);

        Further down, we stop TopServices.shutdown, which is included in the error trace stack.

        So, this hang occurs during check-pointing when we are trying to recover an interrupted IO. Obviously, there is a bug in the new recovery code (also).

        But furthermore, I think this means that in the existing implementation, we would sometimes get check-pointing during shutdown erroneously interrupted (since we introduced NIO in RAFContainer4). I am not aware that we have had reports on this, presumably because Derby will recover from the logs anyway at the next reboot..

        Show
        Dag H. Wanvik added a comment - - edited Interestingly, this happens after an interrupt sent by Derby to itself during shutdown, cf. BaseMonitor#shutdown ca line 176, where we see: // Shutdown all threads by iterrupting them contextService.notifyAllActiveThreads((Context) null); Further down, we stop TopServices.shutdown, which is included in the error trace stack. So, this hang occurs during check-pointing when we are trying to recover an interrupted IO. Obviously, there is a bug in the new recovery code (also). But furthermore, I think this means that in the existing implementation, we would sometimes get check-pointing during shutdown erroneously interrupted (since we introduced NIO in RAFContainer4). I am not aware that we have had reports on this, presumably because Derby will recover from the logs anyway at the next reboot..
        Hide
        Knut Anders Hatlen added a comment -

        Yes, that's exactly how I obtained the stack trace above. I used jps to get the pid, and jstack to dump the stack for all threads.

        Show
        Knut Anders Hatlen added a comment - Yes, that's exactly how I obtained the stack trace above. I used jps to get the pid, and jstack to dump the stack for all threads.
        Hide
        Kristian Waagan added a comment -

        Lily, one way to do it is to use the command 'jstack', which comes with the JDK. I believe you can also press CTRL-BREAK (in the window running the Java process).
        You can also use other tools, like VisualVM, or send the signal QUIT to the process.

        I usually use jstack. To find the pid of the Java process, use 'jps' (also comes with the JDK).

        Hope this helps,

        Show
        Kristian Waagan added a comment - Lily, one way to do it is to use the command 'jstack', which comes with the JDK. I believe you can also press CTRL-BREAK (in the window running the Java process). You can also use other tools, like VisualVM, or send the signal QUIT to the process. I usually use jstack. To find the pid of the Java process, use 'jps' (also comes with the JDK). Hope this helps,
        Hide
        Lily Wei added a comment -

        Hi Knut:
        If the test is stuck, how do you print out the stack trace for the waiting thread? I think we can use the same tactic to be the fix for DERBY-4856. Thanks.

        Show
        Lily Wei added a comment - Hi Knut: If the test is stuck, how do you print out the stack trace for the waiting thread? I think we can use the same tactic to be the fix for DERBY-4856 . Thanks.
        Hide
        Dag H. Wanvik added a comment -

        Thanks, Knut. Good to get these issues flushed out. It appears to be an issue during shutdown, Ill look at it asap.

        Show
        Dag H. Wanvik added a comment - Thanks, Knut. Good to get these issues flushed out. It appears to be an issue during shutdown, Ill look at it asap.

          People

          • Assignee:
            Dag H. Wanvik
            Reporter:
            Knut Anders Hatlen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development