Derby
  1. Derby
  2. DERBY-5643

Occasional hangs in replication tests on Linux

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 10.9.1.0
    • Fix Version/s: 10.9.1.0
    • Component/s: Replication, Test
    • Labels:
      None
    • Bug behavior facts:
      Regression Test Failure

      Description

      We occasionally see hangs in the replication tests on Linux. For example here: http://dbtg.foundry.sun.com/derby/test/Daily/jvm1.6/testing/testlog/sles/1298470-suitesAll_diff.txt

      This test run was stuck in tearDown() after ReplicationRun_Local_Derby4910.testSlaveWaitsForMaster(). (Waiting for Thread.join() to return.)

      1. waitFor-2.diff
        2 kB
        Knut Anders Hatlen
      2. waitFor.diff
        1.0 kB
        Knut Anders Hatlen
      3. thread-dump.txt
        12 kB
        Knut Anders Hatlen
      4. higher-timeout.diff
        13 kB
        Knut Anders Hatlen
      5. fail-on-timeout.diff
        4 kB
        Knut Anders Hatlen

        Activity

        Hide
        Knut Anders Hatlen added a comment -

        The thread dump in the above link was truncated. Attaching a file with the full thread dump.

        Unfortunately, I only have the thread dump for the main JUnit process and not for the two sub-processes (master server and slave server). I'll try to get a thread dump for those processes too when it happens the next time.

        Show
        Knut Anders Hatlen added a comment - The thread dump in the above link was truncated. Attaching a file with the full thread dump. Unfortunately, I only have the thread dump for the main JUnit process and not for the two sub-processes (master server and slave server). I'll try to get a thread dump for those processes too when it happens the next time.
        Hide
        Kristian Waagan added a comment -

        I see that the handling of sub-process streams isn't optimal in ReplicationRun. This may lead to problems as in DERBY-5608. It would be good to fix this once further investigation has been carried out.

        Show
        Kristian Waagan added a comment - I see that the handling of sub-process streams isn't optimal in ReplicationRun. This may lead to problems as in DERBY-5608 . It would be good to fix this once further investigation has been carried out.
        Hide
        Knut Anders Hatlen added a comment -

        I ran ReplicationSuite in a loop on one of the machines where this was seen, and after a couple of iterations it was stuck. I ran the suite with derby.tests.repltrace=true, and the trace indicated that the slave server used more than 3 minutes to get up and accept connections. ReplicationRun.startServer() pings the server for 75 seconds before it gives up, so it gave up before the server was up. The attempt to shut down the slave server also failed, because a connection to the server could not be established. tearDown() ended up waiting for the server to stop, and the server of course didn't stop since it never received the shutdown command.

        So the question is: Why does it take 3 minutes for the server to start accepting connections?

        Show
        Knut Anders Hatlen added a comment - I ran ReplicationSuite in a loop on one of the machines where this was seen, and after a couple of iterations it was stuck. I ran the suite with derby.tests.repltrace=true, and the trace indicated that the slave server used more than 3 minutes to get up and accept connections. ReplicationRun.startServer() pings the server for 75 seconds before it gives up, so it gave up before the server was up. The attempt to shut down the slave server also failed, because a connection to the server could not be established. tearDown() ended up waiting for the server to stop, and the server of course didn't stop since it never received the shutdown command. So the question is: Why does it take 3 minutes for the server to start accepting connections?
        Hide
        Kristian Waagan added a comment -

        Just another data point: I ran the replication suite 100 times on Linux (Fedora) without seeing the hang. I consistently took about 3 minutes per iteration. I ran with 1.6.0_22 (IcedTea6 1.10.4), sane Derby build.

        Show
        Kristian Waagan added a comment - Just another data point: I ran the replication suite 100 times on Linux (Fedora) without seeing the hang. I consistently took about 3 minutes per iteration. I ran with 1.6.0_22 (IcedTea6 1.10.4), sane Derby build.
        Hide
        Knut Anders Hatlen added a comment -

        The call that takes three minutes is ServerSockedFactory.createServerSocket().

        I notice that the replication test cases wait until they see EOF on stdout and stderr from the forked processes. However, they do not actually check that the forked processes have terminated before they continue. Maybe this causes a problem if the old server process hasn't completely released all resources when the new server process starts.

        The attached patch adds a call to Process.waitFor() after the streams have been exhausted. It might be unrelated, but I'm running ReplicationSuite in a loop now to see if it helps.

        Show
        Knut Anders Hatlen added a comment - The call that takes three minutes is ServerSockedFactory.createServerSocket(). I notice that the replication test cases wait until they see EOF on stdout and stderr from the forked processes. However, they do not actually check that the forked processes have terminated before they continue. Maybe this causes a problem if the old server process hasn't completely released all resources when the new server process starts. The attached patch adds a call to Process.waitFor() after the streams have been exhausted. It might be unrelated, but I'm running ReplicationSuite in a loop now to see if it helps.
        Hide
        Knut Anders Hatlen added a comment -

        The previous patch didn't help. The test still got stuck rather quickly. But I found another place where we didn't wait for the process to complete. Attaching an updated patch to account for that. With the latest patch, I've had ReplicationSuite running in a loop for 1 1/2 hours without seeing a hang. I usually see the hang in less than half an hour, so it looks promising.

        Show
        Knut Anders Hatlen added a comment - The previous patch didn't help. The test still got stuck rather quickly. But I found another place where we didn't wait for the process to complete. Attaching an updated patch to account for that. With the latest patch, I've had ReplicationSuite running in a loop for 1 1/2 hours without seeing a hang. I usually see the hang in less than half an hour, so it looks promising.
        Hide
        Knut Anders Hatlen added a comment -

        The test eventually hung with the waitFor-2.diff patch too. It took longer than it did before, so maybe it helped, but that may also be coincidental. In any case, the changes should improve resource handling in the replication tests, so I committed the patch with revision 1299573.

        Show
        Knut Anders Hatlen added a comment - The test eventually hung with the waitFor-2.diff patch too. It took longer than it did before, so maybe it helped, but that may also be coincidental. In any case, the changes should improve resource handling in the replication tests, so I committed the patch with revision 1299573.
        Hide
        Knut Anders Hatlen added a comment -

        In the latest test cycle, the compatibility tests hung on one of the platforms where we've seen hangs in the replication tests, so the problem doesn't seem to be specific to the replication tests.

        I found this bug report, which seems to match what we're seeing with createServerSocket() taking more than three minutes: http://bugs.sun.com/view_bug.do?bug_id=6483406

        The bug is fixed in Java 7, which is consistent with what we're seeing in the nightlies (I have only found occurrences with Java 5 and Java 6).

        Perhaps we could work around this problem in the tests by increasing the timeout for server startup to more than 3 minutes. Since we stop waiting as soon as the server is up, this shouldn't slow down test execution unless there actually is a problem with starting the server.

        Show
        Knut Anders Hatlen added a comment - In the latest test cycle, the compatibility tests hung on one of the platforms where we've seen hangs in the replication tests, so the problem doesn't seem to be specific to the replication tests. I found this bug report, which seems to match what we're seeing with createServerSocket() taking more than three minutes: http://bugs.sun.com/view_bug.do?bug_id=6483406 The bug is fixed in Java 7, which is consistent with what we're seeing in the nightlies (I have only found occurrences with Java 5 and Java 6). Perhaps we could work around this problem in the tests by increasing the timeout for server startup to more than 3 minutes. Since we stop waiting as soon as the server is up, this shouldn't slow down test execution unless there actually is a problem with starting the server.
        Hide
        Kristian Waagan added a comment -

        +1 on the extended wait.
        You could consider adding an alarm(...) if waiting for longer than the current limit, but I don't know if it's worth it since I don't know how many people are running the tests to qualify Derby on their machine/platform.

        Show
        Kristian Waagan added a comment - +1 on the extended wait. You could consider adding an alarm(...) if waiting for longer than the current limit, but I don't know if it's worth it since I don't know how many people are running the tests to qualify Derby on their machine/platform.
        Hide
        Knut Anders Hatlen added a comment -

        Attaching a new patch (higher-timeout.diff) that changes the default server startup timeout in NetworkServerTestSetup to four minutes. If the server startup eventually succeeds, but takes more than one minute, it'll call alarm(...) to tell that something isn't quite as it should be, but it won't actually fail unless it has been unsuccessful for four minutes.

        The patch also makes the replication tests and the compatibility tests use NetworkServerTestSetup's helper methods to ping the server while waiting for it to start. This makes it possible to remove some duplicate logic from those tests, and makes the tests behave in a more consistent manner.

        I've run derbyall, suites.All and the compatibility tests successfully with the patch. I've also had the replication tests running in a loop for four hours on a machine where I saw frequent hangs before. No hangs so far, but I've seen two occurrences of the alarm message, which indicates that the condition that previously made the tests hang did happen:

        ALARM: Very slow server startup: 189735 ms
        ALARM: Very slow server startup: 189850 ms

        Show
        Knut Anders Hatlen added a comment - Attaching a new patch (higher-timeout.diff) that changes the default server startup timeout in NetworkServerTestSetup to four minutes. If the server startup eventually succeeds, but takes more than one minute, it'll call alarm(...) to tell that something isn't quite as it should be, but it won't actually fail unless it has been unsuccessful for four minutes. The patch also makes the replication tests and the compatibility tests use NetworkServerTestSetup's helper methods to ping the server while waiting for it to start. This makes it possible to remove some duplicate logic from those tests, and makes the tests behave in a more consistent manner. I've run derbyall, suites.All and the compatibility tests successfully with the patch. I've also had the replication tests running in a loop for four hours on a machine where I saw frequent hangs before. No hangs so far, but I've seen two occurrences of the alarm message, which indicates that the condition that previously made the tests hang did happen: ALARM: Very slow server startup: 189735 ms ALARM: Very slow server startup: 189850 ms
        Hide
        Knut Anders Hatlen added a comment -

        I had the replication tests running for 20 hours without seeing hangs. Nine times the warning about very slow server startup was printed.

        Committed revision 1300475.

        Show
        Knut Anders Hatlen added a comment - I had the replication tests running for 20 hours without seeing hangs. Nine times the warning about very slow server startup was printed. Committed revision 1300475.
        Hide
        Knut Anders Hatlen added a comment -

        The compatibility tests have had problems in the nightlies after this fix went in. I suspect that they are caused by the changes made in this issue, although I did run the compatibility tests successfully in my environment before committing. I'll investigate.

        Show
        Knut Anders Hatlen added a comment - The compatibility tests have had problems in the nightlies after this fix went in. I suspect that they are caused by the changes made in this issue, although I did run the compatibility tests successfully in my environment before committing. I'll investigate.
        Hide
        Knut Anders Hatlen added a comment -

        I've found some problems with the previous changes. I'm not sure if that's what causing problems in the nightly testing, though.

        The replication tests and the compatibility tests now call pingForServerUp() and wait until the server is up or a timeout happens, but they don't fail if a timeout happens. The attached patch (fail-on-timeout.diff) makes those tests check the return value of pingForServerUp() and fail if the server did not come up in time.

        The patch also changes how AutoloadTest.testAutoNetworkServerBoot() verifies that the server did not come up. In the existing code, it pings until the timeout happens, and then returns successfully. This took 40 seconds before the timeout was changed, and now it takes 4 minutes. The timeout is not supposed to affect tests under normal operation, so it should not have pinged that long in the first place.

        The patch makes the test wait for a shorter time (5 seconds) before concluding that the server didn't come up. This may cause bugs to go unnoticed on very slow machines (if the server comes up when it shouldn't, but it takes more than 5 seconds), but it will speed up the test considerably and still detect the problem on reasonably fast machines.

        Show
        Knut Anders Hatlen added a comment - I've found some problems with the previous changes. I'm not sure if that's what causing problems in the nightly testing, though. The replication tests and the compatibility tests now call pingForServerUp() and wait until the server is up or a timeout happens, but they don't fail if a timeout happens. The attached patch (fail-on-timeout.diff) makes those tests check the return value of pingForServerUp() and fail if the server did not come up in time. The patch also changes how AutoloadTest.testAutoNetworkServerBoot() verifies that the server did not come up. In the existing code, it pings until the timeout happens, and then returns successfully. This took 40 seconds before the timeout was changed, and now it takes 4 minutes. The timeout is not supposed to affect tests under normal operation, so it should not have pinged that long in the first place. The patch makes the test wait for a shorter time (5 seconds) before concluding that the server didn't come up. This may cause bugs to go unnoticed on very slow machines (if the server comes up when it shouldn't, but it takes more than 5 seconds), but it will speed up the test considerably and still detect the problem on reasonably fast machines.
        Hide
        Knut Anders Hatlen added a comment -

        Committed revision 1302811.

        Show
        Knut Anders Hatlen added a comment - Committed revision 1302811.
        Hide
        Mike Matrigali added a comment -

        got 7 failures in nightly tests on ibm16 and ibm17. are these the kinds of problems you expect to be fixed by changes in this JIRA? If so I won't bother to open a new JIRA:
        http://people.apache.org/~myrnavl/derby_test_results/main/windows/testlog/ibm16/1301855-derbyall_diff.txt
        http://people.apache.org/~myrnavl/derby_test_results/main/windows/testlog/ibm17/1301855-derbyall_diff.txt

        Show
        Mike Matrigali added a comment - got 7 failures in nightly tests on ibm16 and ibm17. are these the kinds of problems you expect to be fixed by changes in this JIRA? If so I won't bother to open a new JIRA: http://people.apache.org/~myrnavl/derby_test_results/main/windows/testlog/ibm16/1301855-derbyall_diff.txt http://people.apache.org/~myrnavl/derby_test_results/main/windows/testlog/ibm17/1301855-derbyall_diff.txt
        Hide
        Knut Anders Hatlen added a comment -

        The fix only touched helper methods used by the JUnit tests, so I don't think tests in derbyall will be affected by the fix. A new JIRA would probably be in order.

        Show
        Knut Anders Hatlen added a comment - The fix only touched helper methods used by the JUnit tests, so I don't think tests in derbyall will be affected by the fix. A new JIRA would probably be in order.
        Hide
        Knut Anders Hatlen added a comment -

        I mentioned in an earlier comment that the compatibility tests in Oracle's nightly test runs had problems after the fix went in. That wasn't quite true. When looking closer at the test reports, it turned out it was suites.All that had problems. It looks like it got stuck in the replication tests on all platforms except Windows. And the tinderbox has hung in a similar state since Thursday (it did however run successfully twice after the fix for this issue was checked in, but I have no doubt it was the changes in this issue that caused the problems). I haven't been able to reproduce these hangs in my own environment yet.

        The nightly tests haven't run with the latest fix for this issue yet, because the Svn EU mirror was down when the tests were supposed to start last night. Looks like they're running now, and so far two Solaris platforms have completed their suites.All runs - http://dbtg.foundry.sun.com/derby/test/Daily/jvm1.6/testing/Limited/testSummary-1302996.html. Not sure if that means the problem is fixed, as it appears to be intermittent. I guess we know more in a couple of hours when the results from the other platforms appear.

        Show
        Knut Anders Hatlen added a comment - I mentioned in an earlier comment that the compatibility tests in Oracle's nightly test runs had problems after the fix went in. That wasn't quite true. When looking closer at the test reports, it turned out it was suites.All that had problems. It looks like it got stuck in the replication tests on all platforms except Windows. And the tinderbox has hung in a similar state since Thursday (it did however run successfully twice after the fix for this issue was checked in, but I have no doubt it was the changes in this issue that caused the problems). I haven't been able to reproduce these hangs in my own environment yet. The nightly tests haven't run with the latest fix for this issue yet, because the Svn EU mirror was down when the tests were supposed to start last night. Looks like they're running now, and so far two Solaris platforms have completed their suites.All runs - http://dbtg.foundry.sun.com/derby/test/Daily/jvm1.6/testing/Limited/testSummary-1302996.html . Not sure if that means the problem is fixed, as it appears to be intermittent. I guess we know more in a couple of hours when the results from the other platforms appear.
        Hide
        Knut Anders Hatlen added a comment -

        There haven't been any failures in the last two nightly test cycles, so it looks like the last fix sorted out the problems. Resolving again.

        Show
        Knut Anders Hatlen added a comment - There haven't been any failures in the last two nightly test cycles, so it looks like the last fix sorted out the problems. Resolving again.

          People

          • Assignee:
            Knut Anders Hatlen
            Reporter:
            Knut Anders Hatlen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development