Accumulo
  1. Accumulo
  2. ACCUMULO-2198

Concurrent randomwalk fails with unbalanced servers

    Details

      Description

      Not always, but sometimes I am seeing the Concurrent randomwalk test fail with:

      java.lang.Exception: Error running node Concurrent.xml
              at org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:259)
      ...
      Caused by: java.lang.Exception: Error running node ct.CheckBalance
              at org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:259)
              at org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:251)
              ... 8 more
      Caused by: java.lang.Exception: servers are unbalanced!
              at org.apache.accumulo.server.test.randomwalk.concurrent.CheckBalance.visit(CheckBalance.java:74)
              at org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:251)
              ... 9 more
      

      In one case, the 15-minute allowance for balancing extended to a prior run of Concurrent.xml within the same overall test run. In another case, the time span begins at a point when HDFS failed to contact a datanode.

        Issue Links

          Activity

          Hide
          Bill Havanki added a comment -

          Review available for at least a partial solution.

          Show
          Bill Havanki added a comment - Review available for at least a partial solution.
          Hide
          Bill Havanki added a comment -

          The second review builds on the first. It adds an additional check backported from 1.5.x, so that the test cannot fail until there have been at least three failed server balance checks (in a row). This should improve the success rate of the Concurrent test under 1.4.x.

          Show
          Bill Havanki added a comment - The second review builds on the first. It adds an additional check backported from 1.5.x, so that the test cannot fail until there have been at least three failed server balance checks (in a row). This should improve the success rate of the Concurrent test under 1.4.x.
          Hide
          ASF subversion and git services added a comment -

          Commit cd4eac0d7e2820321db9fc9cdfc8dc89f7dd53d2 in branch refs/heads/1.4.5-SNAPSHOT from Bill Havanki
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=cd4eac0 ]

          ACCUMULO-2198 Concurrent randomwalk: add teardown, fix server balance check

          The Concurrent randomwalk test had been using a test node property to remember the
          last time when servers were unbalanced, but this property was not getting cleaned up
          between runs. Therefore, if a new Concurrent test was started some time later, it
          would pick up the old timestamp property from the last run. This commit adds removal
          of the property during test teardown, and also moves the tracking from a node
          property to test state.

          In addition, the test logic would reset the timestamp every time servers were found
          unbalanced, provided the 15-minute allowance hadn't expired. This commit fixes that
          issue as well. This could lead to more, correct, reports of unbalanced servers.

          Lastly, the test in 1.5.x requires three checks for unbalanced servers to fail before
          failing the test. This commit backports that requirement to 1.4.x.

          The timestamp reset and three-check fixes were added to 1.5.x in commit 0ee7e5a8.

          Show
          ASF subversion and git services added a comment - Commit cd4eac0d7e2820321db9fc9cdfc8dc89f7dd53d2 in branch refs/heads/1.4.5-SNAPSHOT from Bill Havanki [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=cd4eac0 ] ACCUMULO-2198 Concurrent randomwalk: add teardown, fix server balance check The Concurrent randomwalk test had been using a test node property to remember the last time when servers were unbalanced, but this property was not getting cleaned up between runs. Therefore, if a new Concurrent test was started some time later, it would pick up the old timestamp property from the last run. This commit adds removal of the property during test teardown, and also moves the tracking from a node property to test state. In addition, the test logic would reset the timestamp every time servers were found unbalanced, provided the 15-minute allowance hadn't expired. This commit fixes that issue as well. This could lead to more, correct, reports of unbalanced servers. Lastly, the test in 1.5.x requires three checks for unbalanced servers to fail before failing the test. This commit backports that requirement to 1.4.x. The timestamp reset and three-check fixes were added to 1.5.x in commit 0ee7e5a8.
          Hide
          ASF subversion and git services added a comment -

          Commit cd4eac0d7e2820321db9fc9cdfc8dc89f7dd53d2 in branch refs/heads/1.5.1-SNAPSHOT from Bill Havanki
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=cd4eac0 ]

          ACCUMULO-2198 Concurrent randomwalk: add teardown, fix server balance check

          The Concurrent randomwalk test had been using a test node property to remember the
          last time when servers were unbalanced, but this property was not getting cleaned up
          between runs. Therefore, if a new Concurrent test was started some time later, it
          would pick up the old timestamp property from the last run. This commit adds removal
          of the property during test teardown, and also moves the tracking from a node
          property to test state.

          In addition, the test logic would reset the timestamp every time servers were found
          unbalanced, provided the 15-minute allowance hadn't expired. This commit fixes that
          issue as well. This could lead to more, correct, reports of unbalanced servers.

          Lastly, the test in 1.5.x requires three checks for unbalanced servers to fail before
          failing the test. This commit backports that requirement to 1.4.x.

          The timestamp reset and three-check fixes were added to 1.5.x in commit 0ee7e5a8.

          Show
          ASF subversion and git services added a comment - Commit cd4eac0d7e2820321db9fc9cdfc8dc89f7dd53d2 in branch refs/heads/1.5.1-SNAPSHOT from Bill Havanki [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=cd4eac0 ] ACCUMULO-2198 Concurrent randomwalk: add teardown, fix server balance check The Concurrent randomwalk test had been using a test node property to remember the last time when servers were unbalanced, but this property was not getting cleaned up between runs. Therefore, if a new Concurrent test was started some time later, it would pick up the old timestamp property from the last run. This commit adds removal of the property during test teardown, and also moves the tracking from a node property to test state. In addition, the test logic would reset the timestamp every time servers were found unbalanced, provided the 15-minute allowance hadn't expired. This commit fixes that issue as well. This could lead to more, correct, reports of unbalanced servers. Lastly, the test in 1.5.x requires three checks for unbalanced servers to fail before failing the test. This commit backports that requirement to 1.4.x. The timestamp reset and three-check fixes were added to 1.5.x in commit 0ee7e5a8.
          Hide
          ASF subversion and git services added a comment -

          Commit cd4eac0d7e2820321db9fc9cdfc8dc89f7dd53d2 in branch refs/heads/1.6.0-SNAPSHOT from Bill Havanki
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=cd4eac0 ]

          ACCUMULO-2198 Concurrent randomwalk: add teardown, fix server balance check

          The Concurrent randomwalk test had been using a test node property to remember the
          last time when servers were unbalanced, but this property was not getting cleaned up
          between runs. Therefore, if a new Concurrent test was started some time later, it
          would pick up the old timestamp property from the last run. This commit adds removal
          of the property during test teardown, and also moves the tracking from a node
          property to test state.

          In addition, the test logic would reset the timestamp every time servers were found
          unbalanced, provided the 15-minute allowance hadn't expired. This commit fixes that
          issue as well. This could lead to more, correct, reports of unbalanced servers.

          Lastly, the test in 1.5.x requires three checks for unbalanced servers to fail before
          failing the test. This commit backports that requirement to 1.4.x.

          The timestamp reset and three-check fixes were added to 1.5.x in commit 0ee7e5a8.

          Show
          ASF subversion and git services added a comment - Commit cd4eac0d7e2820321db9fc9cdfc8dc89f7dd53d2 in branch refs/heads/1.6.0-SNAPSHOT from Bill Havanki [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=cd4eac0 ] ACCUMULO-2198 Concurrent randomwalk: add teardown, fix server balance check The Concurrent randomwalk test had been using a test node property to remember the last time when servers were unbalanced, but this property was not getting cleaned up between runs. Therefore, if a new Concurrent test was started some time later, it would pick up the old timestamp property from the last run. This commit adds removal of the property during test teardown, and also moves the tracking from a node property to test state. In addition, the test logic would reset the timestamp every time servers were found unbalanced, provided the 15-minute allowance hadn't expired. This commit fixes that issue as well. This could lead to more, correct, reports of unbalanced servers. Lastly, the test in 1.5.x requires three checks for unbalanced servers to fail before failing the test. This commit backports that requirement to 1.4.x. The timestamp reset and three-check fixes were added to 1.5.x in commit 0ee7e5a8.
          Hide
          ASF subversion and git services added a comment -

          Commit cd4eac0d7e2820321db9fc9cdfc8dc89f7dd53d2 in branch refs/heads/master from Bill Havanki
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=cd4eac0 ]

          ACCUMULO-2198 Concurrent randomwalk: add teardown, fix server balance check

          The Concurrent randomwalk test had been using a test node property to remember the
          last time when servers were unbalanced, but this property was not getting cleaned up
          between runs. Therefore, if a new Concurrent test was started some time later, it
          would pick up the old timestamp property from the last run. This commit adds removal
          of the property during test teardown, and also moves the tracking from a node
          property to test state.

          In addition, the test logic would reset the timestamp every time servers were found
          unbalanced, provided the 15-minute allowance hadn't expired. This commit fixes that
          issue as well. This could lead to more, correct, reports of unbalanced servers.

          Lastly, the test in 1.5.x requires three checks for unbalanced servers to fail before
          failing the test. This commit backports that requirement to 1.4.x.

          The timestamp reset and three-check fixes were added to 1.5.x in commit 0ee7e5a8.

          Show
          ASF subversion and git services added a comment - Commit cd4eac0d7e2820321db9fc9cdfc8dc89f7dd53d2 in branch refs/heads/master from Bill Havanki [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=cd4eac0 ] ACCUMULO-2198 Concurrent randomwalk: add teardown, fix server balance check The Concurrent randomwalk test had been using a test node property to remember the last time when servers were unbalanced, but this property was not getting cleaned up between runs. Therefore, if a new Concurrent test was started some time later, it would pick up the old timestamp property from the last run. This commit adds removal of the property during test teardown, and also moves the tracking from a node property to test state. In addition, the test logic would reset the timestamp every time servers were found unbalanced, provided the 15-minute allowance hadn't expired. This commit fixes that issue as well. This could lead to more, correct, reports of unbalanced servers. Lastly, the test in 1.5.x requires three checks for unbalanced servers to fail before failing the test. This commit backports that requirement to 1.4.x. The timestamp reset and three-check fixes were added to 1.5.x in commit 0ee7e5a8.

            People

            • Assignee:
              Bill Havanki
              Reporter:
              Bill Havanki
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development