Derby
  1. Derby
  2. DERBY-6001

ErrorMessageTest assert failure: Only one of the waiters should be aborted

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 10.10.1.1
    • Fix Version/s: 10.10.1.1
    • Component/s: Test
    • Labels:
      None
    • Environment:
      Java SE Embedded 7u6 for ARM

      Description

      I occasionally see this test failure on some ARM devices:

      junit.framework.AssertionFailedError: Only one of the waiters should be aborted
      at org.apache.derbyTesting.functionTests.tests.lang.ErrorMessageTest.testDeadlockTimeout(ErrorMessageTest.java:206)
      at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:117)
      at org.apache.derbyTesting.junit.BaseJDBCTestCase.runBareOverridable(BaseJDBCTestCase.java:424)
      at org.apache.derbyTesting.junit.BaseJDBCTestCase.runBare(BaseJDBCTestCase.java:441)
      at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
      at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
      at junit.extensions.TestSetup.run(TestSetup.java:25)
      at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57)
      at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
      at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
      at junit.extensions.TestSetup.run(TestSetup.java:25)
      at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57)

      It's probably timing-dependent, since the failing test case runs two threads, and the devices where it's seen are slow compared to most other test servers.

      1. d6001-2a-waitTimeout.diff
        3 kB
        Knut Anders Hatlen
      2. d6001-1a-diagnostics.diff
        2 kB
        Knut Anders Hatlen

        Issue Links

          Activity

          Hide
          Knut Anders Hatlen added a comment -

          Haven't seen this again after the fix went in. Resolving.

          Show
          Knut Anders Hatlen added a comment - Haven't seen this again after the fix went in. Resolving.
          Hide
          Knut Anders Hatlen added a comment -

          The test has been running in a loop for a couple of hours without failing.

          Committed revision 1412577.

          I'm keeping the issue open until we've seen that the test runs more reliably in a few full test runs.

          Show
          Knut Anders Hatlen added a comment - The test has been running in a loop for a couple of hours without failing. Committed revision 1412577. I'm keeping the issue open until we've seen that the test runs more reliably in a few full test runs.
          Hide
          Knut Anders Hatlen added a comment -

          Attaching d6001-2a-waitTimeout.diff which increases the wait timeout for testDeadlockTimeout to 60 seconds. The test still completes in just a few seconds, as the higher timeout just gives it a larger window, it doesn't require the test to use the entire available time slot.

          Since the test now uses different timeout values for testLockTimeout and testDeadlockTimeout, I also reduced the wait timeout for the former test case to speed it up. That test case is single-threaded and not timing sensitive, so reducing the timeout should not reduce the chance of success.

          I also reduced the deadlock timeout for testDeadlockTimeout to 1 second, as it is the difference between the wait timeout and the deadlock timeout that's determining the chance of success, not the size of the deadlock timeout. So we could just as well keep it short to make the test complete faster (even if it's just one second faster).

          I'm running the test in a loop on one of the machines where I've seen the problem (successfully so far). The problem happens a lot less frequently when the test runs standalone, though, so I'm sure that necessarily will help finding out if it's fixed.

          (suites.All takes 15 hours on that platform, so I'm not going to run suites.All in a loop to verify the fix...)

          Show
          Knut Anders Hatlen added a comment - Attaching d6001-2a-waitTimeout.diff which increases the wait timeout for testDeadlockTimeout to 60 seconds. The test still completes in just a few seconds, as the higher timeout just gives it a larger window, it doesn't require the test to use the entire available time slot. Since the test now uses different timeout values for testLockTimeout and testDeadlockTimeout, I also reduced the wait timeout for the former test case to speed it up. That test case is single-threaded and not timing sensitive, so reducing the timeout should not reduce the chance of success. I also reduced the deadlock timeout for testDeadlockTimeout to 1 second, as it is the difference between the wait timeout and the deadlock timeout that's determining the chance of success, not the size of the deadlock timeout. So we could just as well keep it short to make the test complete faster (even if it's just one second faster). I'm running the test in a loop on one of the machines where I've seen the problem (successfully so far). The problem happens a lot less frequently when the test runs standalone, though, so I'm sure that necessarily will help finding out if it's fixed. (suites.All takes 15 hours on that platform, so I'm not going to run suites.All in a loop to verify the fix...)
          Hide
          Knut Anders Hatlen added a comment -

          This happened again. With the improved diagnostics, I found in the log that one of the threads failed with the expected deadlock exception, and the other thread failed with a timeout exception. So it seems like the difference between the deadlock timeout and the regular wait timeout is too small on this platform. The deadlock is detected, as expected, but the chosen victim is not terminated quickly enough to prevent the other transaction from timing out.

          Since this test case is not expected to produce a timeout, I think we can increase the lock timeout without slowing it down in the common case. That should give slower machines more time to kill the victim before the timeout expires. Faster machines don't need to wait for the timeout to happen, so they should be unaffected.

          Show
          Knut Anders Hatlen added a comment - This happened again. With the improved diagnostics, I found in the log that one of the threads failed with the expected deadlock exception, and the other thread failed with a timeout exception. So it seems like the difference between the deadlock timeout and the regular wait timeout is too small on this platform. The deadlock is detected, as expected, but the chosen victim is not terminated quickly enough to prevent the other transaction from timing out. Since this test case is not expected to produce a timeout, I think we can increase the lock timeout without slowing it down in the common case. That should give slower machines more time to kill the victim before the timeout expires. Faster machines don't need to wait for the timeout to happen, so they should be unaffected.
          Hide
          Knut Anders Hatlen added a comment -

          Attaching d6001-1a-diagnostics.diff which makes the test print more information when this failure happens. It should now print the stack traces of both failing threads the next time it happens.

          Committed revision 1412042.

          Show
          Knut Anders Hatlen added a comment - Attaching d6001-1a-diagnostics.diff which makes the test print more information when this failure happens. It should now print the stack traces of both failing threads the next time it happens. Committed revision 1412042.

            People

            • Assignee:
              Knut Anders Hatlen
              Reporter:
              Knut Anders Hatlen
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development