Accumulo
  1. Accumulo
  2. ACCUMULO-2488

Concurrent randomwalk balance check needs refinement

    Details

    • Type: Test Test
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.4.4
    • Fix Version/s: 1.4.5, 1.5.2, 1.6.0
    • Component/s: test
    • Labels:

      Description

      The check for balanced tablets in the randomwalk Concurrent test too easily fails.

      Here is a real-life example from the test for the number of tablets across five tablet servers: 2, 5, 2, 2, 3. (An old unrelated table plays into these totals.) This produces a mean of 2.8. The cluster is considered unbalanced by the test when any server's count differs from the mean by the larger of 1 or the mean divided by 5. In this case, 2.8/5 is less than 1, so the second tablet server fails since it has more than 3.8 tablets. Even a 4 would fail.

      Part of the problem in this particular case is that there are so few tablets, and so few tablet servers. The cluster also seems happy to leave these counts as is, as I continue to check it, so the test's definition of unbalanced is too narrow.

      The test needs to be refined to detect unbalanced conditions with a statistically decent calculation.

        Issue Links

          Activity

          Bill Havanki created issue -
          Bill Havanki made changes -
          Field Original Value New Value
          Link This issue relates to ACCUMULO-2198 [ ACCUMULO-2198 ]
          Hide
          Mike Drob added a comment -

          The default balancer will allow a difference of up to the number of tables before it starts moving tablets. In this case it sounds like you had at least 4 (metadata, trace, old table, active table) and the largest difference was 3 (5-2), so the balancer did not do anything, as intended.

          Show
          Mike Drob added a comment - The default balancer will allow a difference of up to the number of tables before it starts moving tablets. In this case it sounds like you had at least 4 (metadata, trace, old table, active table) and the largest difference was 3 (5-2), so the balancer did not do anything, as intended.
          Hide
          Bill Havanki added a comment -

          Thanks, that's good information. There are actually five tables, as it turns out.

          Now, the question is, what should this balance check test for? Should it test that the default balancer is working as expected? Or should it test that the outcome fits some general definition of "balanced"? For example, that no server should be more than n standard deviations away from the mean.

          Show
          Bill Havanki added a comment - Thanks, that's good information. There are actually five tables, as it turns out. Now, the question is, what should this balance check test for? Should it test that the default balancer is working as expected? Or should it test that the outcome fits some general definition of "balanced"? For example, that no server should be more than n standard deviations away from the mean.
          Bill Havanki made changes -
          Status Open [ 1 ] In Progress [ 3 ]
          Bill Havanki made changes -
          Status In Progress [ 3 ] Patch Available [ 10002 ]
          Fix Version/s 1.4.5 [ 12324754 ]
          Fix Version/s 1.5.2 [ 12326272 ]
          Fix Version/s 1.6.0 [ 12322468 ]
          Bill Havanki made changes -
          Remote Link This issue links to "Review (Web Link)" [ 14606 ]
          Hide
          Bill Havanki added a comment -

          Review up. For this attempt I chose to change the definition of an unbalanced server to one whose tablet count differs from the average by more than twice the standard deviation.

          Show
          Bill Havanki added a comment - Review up. For this attempt I chose to change the definition of an unbalanced server to one whose tablet count differs from the average by more than twice the standard deviation.
          Hide
          ASF subversion and git services added a comment -

          Commit a4174248a96cadcc79a9de4015c90c6618a96418 in accumulo's branch refs/heads/1.4.5-SNAPSHOT from Bill Havanki
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=a417424 ]

          ACCUMULO-2488 Change criteria for unbalanced servers in concurrent randomwalk

          The Concurrent randomwalk test used to consider servers unbalanced if any server's
          tablet count differed from the cluster average by more than a fifth of the average or
          by one, whichever was larger. This would cause failures under typical balancings from
          the default balancer.

          This commit changes the criterion for an unbalanced server to be double the standard
          deviation from the cluster average.

          Show
          ASF subversion and git services added a comment - Commit a4174248a96cadcc79a9de4015c90c6618a96418 in accumulo's branch refs/heads/1.4.5-SNAPSHOT from Bill Havanki [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=a417424 ] ACCUMULO-2488 Change criteria for unbalanced servers in concurrent randomwalk The Concurrent randomwalk test used to consider servers unbalanced if any server's tablet count differed from the cluster average by more than a fifth of the average or by one, whichever was larger. This would cause failures under typical balancings from the default balancer. This commit changes the criterion for an unbalanced server to be double the standard deviation from the cluster average.
          Hide
          ASF subversion and git services added a comment -

          Commit a4174248a96cadcc79a9de4015c90c6618a96418 in accumulo's branch refs/heads/1.5.2-SNAPSHOT from Bill Havanki
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=a417424 ]

          ACCUMULO-2488 Change criteria for unbalanced servers in concurrent randomwalk

          The Concurrent randomwalk test used to consider servers unbalanced if any server's
          tablet count differed from the cluster average by more than a fifth of the average or
          by one, whichever was larger. This would cause failures under typical balancings from
          the default balancer.

          This commit changes the criterion for an unbalanced server to be double the standard
          deviation from the cluster average.

          Show
          ASF subversion and git services added a comment - Commit a4174248a96cadcc79a9de4015c90c6618a96418 in accumulo's branch refs/heads/1.5.2-SNAPSHOT from Bill Havanki [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=a417424 ] ACCUMULO-2488 Change criteria for unbalanced servers in concurrent randomwalk The Concurrent randomwalk test used to consider servers unbalanced if any server's tablet count differed from the cluster average by more than a fifth of the average or by one, whichever was larger. This would cause failures under typical balancings from the default balancer. This commit changes the criterion for an unbalanced server to be double the standard deviation from the cluster average.
          Hide
          ASF subversion and git services added a comment -

          Commit a4174248a96cadcc79a9de4015c90c6618a96418 in accumulo's branch refs/heads/1.6.0-SNAPSHOT from Bill Havanki
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=a417424 ]

          ACCUMULO-2488 Change criteria for unbalanced servers in concurrent randomwalk

          The Concurrent randomwalk test used to consider servers unbalanced if any server's
          tablet count differed from the cluster average by more than a fifth of the average or
          by one, whichever was larger. This would cause failures under typical balancings from
          the default balancer.

          This commit changes the criterion for an unbalanced server to be double the standard
          deviation from the cluster average.

          Show
          ASF subversion and git services added a comment - Commit a4174248a96cadcc79a9de4015c90c6618a96418 in accumulo's branch refs/heads/1.6.0-SNAPSHOT from Bill Havanki [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=a417424 ] ACCUMULO-2488 Change criteria for unbalanced servers in concurrent randomwalk The Concurrent randomwalk test used to consider servers unbalanced if any server's tablet count differed from the cluster average by more than a fifth of the average or by one, whichever was larger. This would cause failures under typical balancings from the default balancer. This commit changes the criterion for an unbalanced server to be double the standard deviation from the cluster average.
          Hide
          ASF subversion and git services added a comment -

          Commit a4174248a96cadcc79a9de4015c90c6618a96418 in accumulo's branch refs/heads/master from Bill Havanki
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=a417424 ]

          ACCUMULO-2488 Change criteria for unbalanced servers in concurrent randomwalk

          The Concurrent randomwalk test used to consider servers unbalanced if any server's
          tablet count differed from the cluster average by more than a fifth of the average or
          by one, whichever was larger. This would cause failures under typical balancings from
          the default balancer.

          This commit changes the criterion for an unbalanced server to be double the standard
          deviation from the cluster average.

          Show
          ASF subversion and git services added a comment - Commit a4174248a96cadcc79a9de4015c90c6618a96418 in accumulo's branch refs/heads/master from Bill Havanki [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=a417424 ] ACCUMULO-2488 Change criteria for unbalanced servers in concurrent randomwalk The Concurrent randomwalk test used to consider servers unbalanced if any server's tablet count differed from the cluster average by more than a fifth of the average or by one, whichever was larger. This would cause failures under typical balancings from the default balancer. This commit changes the criterion for an unbalanced server to be double the standard deviation from the cluster average.
          Hide
          Bill Havanki added a comment -

          I've committed the adjustment as reviewed. ACCUMULO-2494 is about updating the standard deviation computation in o.a.a.core.util.Stat. Depending on how that turns out, the updated capability could be reused here, replacing the naive method I implemented. I'd be fine with either reopening this ticket to put in that change or having it done under ACCUMULO-2494.

          Show
          Bill Havanki added a comment - I've committed the adjustment as reviewed. ACCUMULO-2494 is about updating the standard deviation computation in o.a.a.core.util.Stat . Depending on how that turns out, the updated capability could be reused here, replacing the naive method I implemented. I'd be fine with either reopening this ticket to put in that change or having it done under ACCUMULO-2494 .
          Bill Havanki made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Eric Newton made changes -
          Link This issue relates to ACCUMULO-2673 [ ACCUMULO-2673 ]
          Josh Elser made changes -
          Link This issue relates to ACCUMULO-3141 [ ACCUMULO-3141 ]

            People

            • Assignee:
              Bill Havanki
              Reporter:
              Bill Havanki
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development