Kafka
  1. Kafka
  2. KAFKA-1300

Added WaitForReplaction admin tool.

    Details

    • Type: New Feature New Feature
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.8.0
    • Fix Version/s: 0.9.0
    • Component/s: tools
    • Labels:
    • Environment:
      Ubuntu 12.04

      Description

      I have created a tool similar to the broker shutdown tool for doing rolling restarts of Kafka clusters.

      The tool watches the max replica lag of the specified broker, and waits until the lag drops to 0 before exiting.

      To do a rolling restart, here's the process we use:

      for (broker <- brokers)

      { run shutdown tool for broker terminate broker start new broker run wait for replication tool on new broker }

      Here's an example command line use:

      ./kafka-run-class.sh kafka.admin.WaitForReplication --zookeeper zk.host.com:2181 --num.retries 100 --retry.interval.ms 60000 --broker 0

        Issue Links

          Activity

          Hide
          Brenden Matthews added a comment -

          Bump!

          Anyone interested in this? Presumably this would be valuable to others.

          Show
          Brenden Matthews added a comment - Bump! Anyone interested in this? Presumably this would be valuable to others.
          Hide
          Joel Koshy added a comment -

          Is this needed given that controlled shutdown is inbuilt into the broker? The retry counts and retry intervals are also configurable.

          Show
          Joel Koshy added a comment - Is this needed given that controlled shutdown is inbuilt into the broker? The retry counts and retry intervals are also configurable.
          Hide
          Brenden Matthews added a comment -

          This tool is orthogonal to the controlled shutdown tool. This is to help ensure that, once a broker comes online, it is in a fully replicated state.

          Show
          Brenden Matthews added a comment - This tool is orthogonal to the controlled shutdown tool. This is to help ensure that, once a broker comes online, it is in a fully replicated state.
          Hide
          Joel Koshy added a comment -

          Understood, but the primary use case would be to proceed to do a controlled
          shutdown of the next broker in the shutdown plan. However, with retries and
          a large enough retry interval that is not needed. (E.g., you can set a very
          large number of retries.)

          The documentation recommends closely monitoring under-replicated-partition
          counts across the cluster (and alert if it is anything other than zero).
          i.e., ensuring brokers are in a fully replicated state is a "best-practice"
          for operations and should be 24/7 (not just during bounces).

          Show
          Joel Koshy added a comment - Understood, but the primary use case would be to proceed to do a controlled shutdown of the next broker in the shutdown plan. However, with retries and a large enough retry interval that is not needed. (E.g., you can set a very large number of retries.) The documentation recommends closely monitoring under-replicated-partition counts across the cluster (and alert if it is anything other than zero). i.e., ensuring brokers are in a fully replicated state is a "best-practice" for operations and should be 24/7 (not just during bounces).
          Hide
          Alexis Midon added a comment - - edited

          Consiering that Kafka is designed to handle some replication lag, if you need to shutdown a broker it does not seem very useful to wait for the replica lag to be zero.
          (If the broker is X messages behind, and my maintenance requires Y=F(message throughput) minutes, I can safely shutdown the broker is X+Y/throughput < replica.lag.max.messages.

          So maybe that command will be more useful if it could take an argument that characterize X, i.e. how far behind can the broker be before a shutdown.

          Show
          Alexis Midon added a comment - - edited Consiering that Kafka is designed to handle some replication lag, if you need to shutdown a broker it does not seem very useful to wait for the replica lag to be zero. (If the broker is X messages behind, and my maintenance requires Y=F(message throughput) minutes, I can safely shutdown the broker is X+Y/throughput < replica.lag.max.messages. So maybe that command will be more useful if it could take an argument that characterize X, i.e. how far behind can the broker be before a shutdown.
          Hide
          Guozhang Wang added a comment -

          Moving out of 0.8.2 as for now.

          Show
          Guozhang Wang added a comment - Moving out of 0.8.2 as for now.

            People

            • Assignee:
              Unassigned
              Reporter:
              Brenden Matthews
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Development