[HBASE-25697] StochasticBalancer improvement for large scale clusters - ASF JIRA

Add vote

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Balancer, master, UI
Labels:
None

Description

Findings on a large scale cluster (100,000 regions on 300 nodes)

Balancer starts and stops before getting a plan
Adding new racks doesn’t trigger balancer
Balancer stops leaving some racks at 50% lower region counts
Regions for large tables don’t get evenly distributed
Observability is poor
Too many knobs makes tuning empirical and takes many experiments

Improvements made and being made

Cost function enhancement to capture outliers especially table skew. https://issues.apache.org/jira/browse/HBASE-25625?filter=-2
Explain why balancer stops https://issues.apache.org/jira/browse/HBASE-25666 will back port too https://issues.apache.org/jira/browse/HBASE-24528

More proposals

minCostNeedBalance for each cost function instead of weights. We want to trigger balancing if any factor is out of balancer instead of trying to combine the factors in arbitrary weights. This makes operation and configuration much easier.
Simulated annealing to lower minCostNeedBalance periodically to unstuck the balancer from sub-optimum then gradually increase to keep the system stable. Also add cost of move as a counter measure for the decision https://opensourcelibs.com/lib/tempest
Orchestrated scheduling of compaction, normalizer and balancer
PID approach https://www.amazon.com/dp/1449361692/ref=rdr_ext_tmb

Attachments

Issue Links

Add Link

is a parent of

HBASE-26308 Sum of multiplier of cost functions is not populated properly when we have a shortcut for trigger

Resolved

Delete this link

HBASE-26310 Repro balancer behavior during iterations

Resolved

Delete this link

HBASE-26237 Improve computation complexity for primaryRegionCountSkewCostFunctio

Resolved

Delete this link

HBASE-26177 Add support to run balancer overriding current config

Open

Delete this link

HBASE-27302 Adding a trigger for Stochastica Balancer to safeguard for upper bound outliers.

Open

Delete this link

HBASE-26178 Improve data structure and algorithm for BalanceClusterState to improve computation speed for large cluster

Resolved

Delete this link

HBASE-26297 Balancer run is improperly triggered by accuracy error of double comparison

Resolved

Delete this link

HBASE-26311 Balancer gets stuck in cohosted replica distribution

Resolved

Delete this link

HBASE-24643 Replace Cluster#primariesOfRegionsPerServer from int array to treemap

Open

Delete this link

HBASE-25625 StochasticBalancer CostFunctions needs a better way to evaluate region count distribution

Open

Delete this link

HBASE-25666 Explain why balancer is skipping runs

Resolved

Delete this link

HBASE-26309 Balancer tends to move regions to the server at the end of list

Resolved

Delete this link

HBASE-26337 Optimization for weighted random generators

Resolved

Delete this link

is related to

HBASE-24528 Improve balancer decision observability

Resolved

Delete this link

HBASE-25666 Explain why balancer is skipping runs

Resolved

Delete this link

HBASE-26147 Add dry run mode to hbase balancer

Resolved

Delete this link

(8 is a parent of, 3 is related to)

Sub-Tasks

Create Sub-Task

There are no Sub-Tasks for this issue.

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Unassigned

Reporter:: Clara Xiong

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 25/Mar/21 20:10

Updated:: 14/Aug/22 04:03

Agile

Slack

Issue deployment