Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2548

Rebalancer tool should be able to run even if there are permanently dead tablet servers

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.7.1
    • 1.10.0
    • None
    • None

    Description

      The rebalancer will bail as soon as it sees a down tablet server, including at the beginning before it does rebalancing. There's a few reasons for this:

      1. Rebalancing shouldn't fight with re-replication. If a tablet server is down for a while, all its replicas will need to be re-replicated. Since rebalancing is greedy and can be interrupted or resumed anytime, it's better to exit, allow re-replication to occur, and then resume rebalancing.
      2. It's more complicated to figure out how to balance correctly with a greedy algorithm if tablet servers can come and go, since coming and going changes the balance state of the cluster. We allow TS to join the cluster and will begin to move replicas there, but if we allow TS to go down we ought to think about handling if they come back. It's easier to leave solving this problem for when rebalancing and re-replication are somewhat unified in the master.

      Nevertheless, it's a bummer that if, e.g., a user decom'd a tserver 3 months ago, the rebalancer won't run because the rebalancer's ksck says a tserver is unavailable. We can fix this very cleanly once proper decommissioning has been implemented- with a distinction between "gone missing" and "decommissioned", we can have the RB tool (really ksck) ignore decom'd servers.

      Attachments

        Activity

          People

            wdberkeley William Berkeley
            wdberkeley William Berkeley
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: