Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2950

Support restarting nodes in batches

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • ops-tooling
    • None

    Description

      Once Kudu has the building blocks to orchestrate a rolling restart, it'd be great if we could support restarting multiple nodes at a time.

      Location awareness would play a crucial role in this because, if used to identify racks placement, we could bring down an entire rack at a time if we wanted. If we did this, though, during the controlled restart of a given rack, Kudu would be more vulnerable to the unexpected downtime of another rack.

      One approach would be to support something like HDFS's upgrade domains:

      The idea is to group datanodes in a new dimension called upgrade domain, in addition to the existing rack-based grouping. For example, we can assign all datanodes in the first position of any rack to upgrade domain ud_01, nodes in the second position to upgrade domain ud_02 and so on.
      ...
      By default, 3 replicas of any given block are placed on 3 different upgrade domains. This means all datanodes belonging to a specific upgrade domain collectively won’t store more than one replica of any block.

      The decoupling of physical groups from restartable groups should make a batch restarts more robust to rack failures.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              awong Andrew Wong
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: