Derby
  1. Derby
  2. DERBY-5192

Setting up network server for management tests hangs intermittently

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 10.8.1.2
    • Fix Version/s: 10.8.1.2, 10.9.1.0
    • Component/s: Network Server
    • Labels:
      None
    • Environment:
      Derby 10.8.1.0 - FreeBSD 8.2 (i386) - OpenJDK 6 (b20)
      Derby 10.8.1.0 - Oracle Enterprise Linux 6.0 (x86_64) - OpenJDK 6 (b17)
      Derby 10.8.1.1 - Debian GNU/Linux 6.0.1 (i386) - JDK 7 (build 1.7.0-ea-b135)

      Description

      I've seen on three occurrences with the 10.8.1.0 and 10.8.1.1 release candidates that suites.All has been stuck when setting up the network server decorator for the management test suite. Here's what I see at the end of the console output (running with derby.tests.trace=true):

      [junit] test_jdbc4_1 used 0 ms
      [junit] test_jdbc4_1 used 0 ms
      [junit] test_notBooted used 1623 ms java.net.SocketException: Connection reset
      [junit] at java.net.SocketInputStream.read(SocketInputStream.java:189)
      [junit] at java.net.SocketInputStream.read(SocketInputStream.java:121)
      [junit] at java.net.SocketInputStream.read(SocketInputStream.java:107)
      [junit] at org.apache.derby.impl.drda.NetworkServerControlImpl.fillReplyBuffer(Unknown Source)
      [junit] at org.apache.derby.impl.drda.NetworkServerControlImpl.readResult(Unknown Source)
      [junit] at org.apache.derby.impl.drda.NetworkServerControlImpl.pingWithNoOpen(Unknown Source)
      [junit] at org.apache.derby.impl.drda.NetworkServerControlImpl.ping(Unknown Source)
      [junit] at org.apache.derby.drda.NetworkServerControl.ping(Unknown Source)
      [junit] at org.apache.derbyTesting.junit.NetworkServerTestSetup.pingForServerUp(NetworkServerTestSetup.java:567)
      [junit] at org.apache.derbyTesting.junit.NetworkServerTestSetup.pingForServerStart(NetworkServerTestSetup.java:636)
      [junit] at org.apache.derbyTesting.junit.NetworkServerTestSetup.setUp(NetworkServerTestSetup.java:196)
      [junit] at junit.extensions.TestSetup$1.protect(TestSetup.java:20)
      [junit] at junit.framework.TestResult.runProtected(TestResult.java:124)
      [junit] at junit.extensions.TestSetup.run(TestSetup.java:25)
      [junit] at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57)
      [junit] at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
      [junit] at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
      [junit] at junit.framework.TestResult.runProtected(TestResult.java:124)
      [junit] at junit.extensions.TestSetup.run(TestSetup.java:25)
      [junit] at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
      [junit] at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
      [junit] at junit.framework.TestResult.runProtected(TestResult.java:124)
      [junit] at junit.extensions.TestSetup.run(TestSetup.java:25)
      [junit] at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57)
      [junit] at junit.framework.TestSuite.runTest(TestSuite.java:230)
      [junit] at junit.framework.TestSuite.run(TestSuite.java:225)
      [junit] at junit.framework.TestSuite.runTest(TestSuite.java:230)
      [junit] at junit.framework.TestSuite.run(TestSuite.java:225)
      [junit] at junit.framework.TestSuite.runTest(TestSuite.java:230)
      [junit] at junit.framework.TestSuite.run(TestSuite.java:225)
      [junit] at junit.framework.TestSuite.runTest(TestSuite.java:230)
      [junit] at junit.framework.TestSuite.run(TestSuite.java:225)
      [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:518)
      [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:1052)
      [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:906)

      The exception trace is just printed to the console, but it doesn't make any test fail. (The exception printed on FreeBSD was different, it said "DRDA_InvalidReplyTooShort.S:Invalid reply from network server: Insufficient data." The other two looked like the one above.)

      All the hangs have happened on VirtualBox instances, though with different guest operating systems and JVMs. Probably the timing is different from what we have on physical machines.

      1. wait-notify.diff
        0.9 kB
        Knut Anders Hatlen

        Issue Links

          Activity

          Hide
          Knut Anders Hatlen added a comment -

          When looking at the thread dump for the hung process, I see the
          following:

          • There is a server thread that's blocked on a call to
            shutdownSync.wait() in NetworkServerControlImpl.blockingStart().
            This means the network server is up and running, waiting for a
            shutdown signal.
          • There is no ClientThread, whereas one would expect one if the
            network server was up. This is the thread that accepts incoming
            connections from the clients.
          • The main JUnit thread is stuck in a read call on the client socket.
            This is consistent with what one would expect when there is no
            ClientThread. The client is able to connect to the server, but the
            server won't process the connection.

          Inspecting a heap dump of the process, I notice that the shutdown flag
          in the NetworkServerControlImpl instance is actually true. So it looks
          like the server has been shut down, but it's still waiting for the
          shutdown signal. The only way I can see that this can happen, is if
          the shutdown signal is sent before blockingStart() has come to the
          shutdownSync.wait() call. The wait/notify code here isn't quite
          idiomatic. The wait() call isn't placed inside a while loop checking
          the wait condition, so if the shutdown happens too early, the server
          won't ever notice the notifyAll() call that tells it to stop waiting.

          We have many existing bug reports about problems when starting the
          network server from NetworkServerTestSetup (hangs, exceptions printed
          by ping(), and timeouts waiting for the network server to start). Some
          of these may have the same underlying cause as this bug. A quick JIRA
          search gave these hits: DERBY-4176, DERBY-4201, DERBY-4238,
          DERBY-4265, DERBY-4319, DERBY-4739, DERBY-4914.

          Show
          Knut Anders Hatlen added a comment - When looking at the thread dump for the hung process, I see the following: There is a server thread that's blocked on a call to shutdownSync.wait() in NetworkServerControlImpl.blockingStart(). This means the network server is up and running, waiting for a shutdown signal. There is no ClientThread, whereas one would expect one if the network server was up. This is the thread that accepts incoming connections from the clients. The main JUnit thread is stuck in a read call on the client socket. This is consistent with what one would expect when there is no ClientThread. The client is able to connect to the server, but the server won't process the connection. Inspecting a heap dump of the process, I notice that the shutdown flag in the NetworkServerControlImpl instance is actually true. So it looks like the server has been shut down, but it's still waiting for the shutdown signal. The only way I can see that this can happen, is if the shutdown signal is sent before blockingStart() has come to the shutdownSync.wait() call. The wait/notify code here isn't quite idiomatic. The wait() call isn't placed inside a while loop checking the wait condition, so if the shutdown happens too early, the server won't ever notice the notifyAll() call that tells it to stop waiting. We have many existing bug reports about problems when starting the network server from NetworkServerTestSetup (hangs, exceptions printed by ping(), and timeouts waiting for the network server to start). Some of these may have the same underlying cause as this bug. A quick JIRA search gave these hits: DERBY-4176 , DERBY-4201 , DERBY-4238 , DERBY-4265 , DERBY-4319 , DERBY-4739 , DERBY-4914 .
          Hide
          Knut Anders Hatlen added a comment -

          The attached patch changes the wait/notify code to use the more common pattern of doing the wait inside a loop that checks the wait condition. This should have the following advantages over the the current code:

          1) If the network server is shut down before blockingStart() gets to the wait() call, it won't hang indefinitely.

          2) If a spurious wakeup happen in the wait() call (which may happen, according to the Java SE javadocs) blockingStart() won't return prematurely.

          Show
          Knut Anders Hatlen added a comment - The attached patch changes the wait/notify code to use the more common pattern of doing the wait inside a loop that checks the wait condition. This should have the following advantages over the the current code: 1) If the network server is shut down before blockingStart() gets to the wait() call, it won't hang indefinitely. 2) If a spurious wakeup happen in the wait() call (which may happen, according to the Java SE javadocs) blockingStart() won't return prematurely.
          Hide
          Knut Anders Hatlen added a comment -

          I tested the patch manually the following way:

          1) In blockingStart(), I added a 20 second sleep right before the synchronized block around the wait() call.

          2) Started a network server using the CLI.

          3) While blockingStart() was in the 20 second sleep, shut down the network server using the CLI.

          4) Ping the network server using the CLI.

          What I observed was:

          I) Without the patch: The shutdown command in (3) would complete successfully. However, the network server process would continue running. The first ping in (4) would fail with a "connection reset" error. Subsequent pings would hang indefinitely.

          II) With the patch: The shutdown command in (3) completed successfully, and the network server process would terminate once the 20 second sleep was completed. When pinging the network server after the shutdown, a "connection reset" error was seen. Subsequent pings would hang until the 20 second sleep was completed and then fail with a "connection reset" error. Pinging the server after the server process had terminated failed with "connection refused".

          So, in short: Without the patch, the server isn't properly shut down, and incoming connection attempts will hang indefinitely. With the patch, the server is shut down, and incoming connection attempts will fail instead of hanging.

          Show
          Knut Anders Hatlen added a comment - I tested the patch manually the following way: 1) In blockingStart(), I added a 20 second sleep right before the synchronized block around the wait() call. 2) Started a network server using the CLI. 3) While blockingStart() was in the 20 second sleep, shut down the network server using the CLI. 4) Ping the network server using the CLI. What I observed was: I) Without the patch: The shutdown command in (3) would complete successfully. However, the network server process would continue running. The first ping in (4) would fail with a "connection reset" error. Subsequent pings would hang indefinitely. II) With the patch: The shutdown command in (3) completed successfully, and the network server process would terminate once the 20 second sleep was completed. When pinging the network server after the shutdown, a "connection reset" error was seen. Subsequent pings would hang until the 20 second sleep was completed and then fail with a "connection reset" error. Pinging the server after the server process had terminated failed with "connection refused". So, in short: Without the patch, the server isn't properly shut down, and incoming connection attempts will hang indefinitely. With the patch, the server is shut down, and incoming connection attempts will fail instead of hanging.
          Hide
          Dag H. Wanvik added a comment -

          I did look at this one too and agree with you analysis. Patch looks right. +1

          Show
          Dag H. Wanvik added a comment - I did look at this one too and agree with you analysis. Patch looks right. +1
          Hide
          Bryan Pendleton added a comment -

          I recall having seen the occasional hang such as this in the past. It never occurred to me to
          use the heap dump to check the status of global variables like that – neat idea!

          Your change seems good to me, too. If I understand your description, although this
          is a generic bug, it's most likely to affect only tests, not "normal" usage, because only
          tests are likely to try to shut the NS down so soon after starting it, is that right?

          Show
          Bryan Pendleton added a comment - I recall having seen the occasional hang such as this in the past. It never occurred to me to use the heap dump to check the status of global variables like that – neat idea! Your change seems good to me, too. If I understand your description, although this is a generic bug, it's most likely to affect only tests, not "normal" usage, because only tests are likely to try to shut the NS down so soon after starting it, is that right?
          Hide
          Knut Anders Hatlen added a comment -

          Thanks, Dag and Bryan, for looking at the patch.

          I think Bryan is right that this bug is not likely to affect servers in production, since they usually run for a longer time before they're shut down. The one thing that could also affect production servers, is the lack of protection against spurious wakeups from the wait() call. The javadoc for wait() says that wait() could return even when notify() and notifyAll() haven't been called, but blockingStart() doesn't account for that and treats every wakeup as a shutdown signal.

          Show
          Knut Anders Hatlen added a comment - Thanks, Dag and Bryan, for looking at the patch. I think Bryan is right that this bug is not likely to affect servers in production, since they usually run for a longer time before they're shut down. The one thing that could also affect production servers, is the lack of protection against spurious wakeups from the wait() call. The javadoc for wait() says that wait() could return even when notify() and notifyAll() haven't been called, but blockingStart() doesn't account for that and treats every wakeup as a shutdown signal.
          Hide
          Knut Anders Hatlen added a comment -

          Committed to trunk with revision 1094315.
          Committed to 10.8 with revision 1094329.

          Show
          Knut Anders Hatlen added a comment - Committed to trunk with revision 1094315. Committed to 10.8 with revision 1094329.

            People

            • Assignee:
              Knut Anders Hatlen
              Reporter:
              Knut Anders Hatlen
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development