Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.14.0
-
None
-
None
Description
Function CheckCompleteReplace in replace rebalance will try to make the leader step down if the replica, which should be removed, is leader, but this may stuck for a while if the replication factor of the table is 1, since there is no voter to transfer leadership.
So it will be ok if we make sure voter num of the tablet is greater than 1 before sending the LeaderStepDown request.
Here's a example:
I execute the following commands to move all the tablets of a tablet server out.
kudu tserver state enter_maintenance ta1 f853d8ab20344c23826716c67fb13ebe
kudu cluster rebalance master1,master2,master3 -ignored_tservers f853d8ab20344c23826716c67fb13ebe -move_replicas_from_ignored_tservers .
And it will stuck at a certain tablet for a while.
it has been stuck for more than 10 minutes.
The reason is that the tablet do leader step too early and stay in leader_transfer_in_progress_ status. Then master tries to send change config to add a peer but get refused by tablet server because of the leader_transfer_in_progress_ status.