Motivation

The number of leader replicas per tablet server can become imbalanced over time, which lead to load skew on some nodes.

Two reasons of load skew:

The main reason. Scan Requests has two modes: LeaderOnly(default) and CLOSEST_REPLICA. For more accurate results, users will choose the LeaderOnly(default) mode. Mostly, the scan load is positive correlation with leader numbers.

The other reason. Write requests, leaders receive write requests and followers receive appendEntries(kudu is UpdateConsensus), the flow of processing is a little different, which is hidden variables, maybe cause imbalanced load. Leader rebalance will make leader and followers balanced and eliminate hidden variables and make service more stable.

To deal with the situation, now users can use kudu CLI leader_step_down command and write a script program to rebalance the leaders. SREs should make the rebalance script run periodically.

In our application situation, We have more than 1500+ kudu clusters and more and more kudu cluster will be deployed, so it's hard that SREs maintenance the rebalance script tasks.

kudu has the auto rebalance and has no auto leader rebalance,

We can do better. Leader kudu-master can do leader rebalance automatically.

Solution

We can add an auto leader rebalance task to avoid leader replicas skew. Running a periodic task do leader rebalance at kudu-master.

Leader rebalance only do leader transfer, do not copy replicas. The basic idea is every tserver leaders' number : replicas' number = 1 : (replica_refactor - 1). This is the argrithms.

If we need leader rebalance, we'd better enable replicas rebalancer. If enable leader rebalancer but disable auto rebalancer the algorithm work well but the effect is not good. The algorithm can be convergence, and the algorithm's target is every tserver' replicas, number of leader : number of follower is 1 : (replica_refactor -1).

Leader Rebalance results

I do some experiments for the effective. I have a cluster, 3 machines: 3 master instances and 3 tserver instances.

I create a table with 40 tablets(partitions) and 3 replica_factor. And load a lots of data (40000000 records).

I disabled the leader rebalance function, and manually leader transfer all tablets to a tserver and run writes and scans.

Then I enabled the the leader rebalance function and runs scans. The workload as below:

The Scan command: ./kudu_tools/kudu perf table_scan $master_list Student -columns=id,name,brief,age,score -num_threads=4 -nofill_cache -replica_selection="LEADER"

40: 0: 0 means node1 : node2: node3

47%, 18%, 19% means node1 : node2: node3

	leader ratio	scan cost	cpu usage	memory	io
before leader rebalance	40: 0: 0	811.586 s	47%, 18%, 19%	no changes	102MB/s ioutil:55%, 8KB/s ioutil:2%, 64KB/s ioutil:3%
after leader rebalance	13: 14: 13	611.012 s	39%, 45%, 35%	no changes	53MB/s ioutil:31%, 80MB/s ioutil:18%, 45MB/s ioutil:24%

Attachments

Issue Links

is related to

KUDU-3061 Balance tablet leaders across TServers

Open

Activity

People

Assignee:: Yuqi Du

Reporter:: Yuqi Du

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Due:: 25/Aug/22

Created:: 11/Aug/22 08:40

Updated:: 26/Oct/23 04:51

Resolved:: 28/Mar/23 04:00