Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-3487

Rebalancer: Balance for 1 replication factor tablet might stuck for leader step down too early

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.14.0
    • 1.17.0
    • None
    • None

    Description

      Function CheckCompleteReplace in replace rebalance will try to make the leader step down if the replica, which should be removed, is leader, but this may stuck for a while if the replication factor of the table is 1, since there is no voter to transfer leadership.

      So it will be ok if we make sure voter num of the tablet is greater than 1 before sending the LeaderStepDown request.

      Here's a example:

      I execute the following commands to move all the tablets of a tablet server out.

      kudu tserver state enter_maintenance ta1 f853d8ab20344c23826716c67fb13ebe
      kudu cluster rebalance master1,master2,master3  -ignored_tservers f853d8ab20344c23826716c67fb13ebe -move_replicas_from_ignored_tservers .

      And it will stuck at a certain tablet for a while. 

      it has been stuck for more than 10 minutes.

      The reason is that the tablet do leader step too early and stay in leader_transfer_in_progress_ status. Then master tries to send change config to add a peer but get refused by tablet server because of the leader_transfer_in_progress_ status.

      Attachments

        Activity

          People

            Unassigned Unassigned
            Song Jiacheng Song Jiacheng
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: