Details

    • Type: New Feature New Feature
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.8.0
    • Fix Version/s: 0.10.1.0
    • Component/s: tools
    • Labels:
    • Environment:
      Ubuntu 12.04

      Description

      I have created a tool similar to the broker shutdown tool for doing rolling restarts of Kafka clusters.

      The tool watches the max replica lag of the specified broker, and waits until the lag drops to 0 before exiting.

      To do a rolling restart, here's the process we use:

      for (broker <- brokers)

      { run shutdown tool for broker terminate broker start new broker run wait for replication tool on new broker }

      Here's an example command line use:

      ./kafka-run-class.sh kafka.admin.WaitForReplication --zookeeper zk.host.com:2181 --num.retries 100 --retry.interval.ms 60000 --broker 0

        Activity

        Hide
        Guozhang Wang added a comment -

        Moving out of 0.8.2 as for now.

        Show
        Guozhang Wang added a comment - Moving out of 0.8.2 as for now.
        Hide
        Alexis Midon added a comment - - edited

        Consiering that Kafka is designed to handle some replication lag, if you need to shutdown a broker it does not seem very useful to wait for the replica lag to be zero.
        (If the broker is X messages behind, and my maintenance requires Y=F(message throughput) minutes, I can safely shutdown the broker is X+Y/throughput < replica.lag.max.messages.

        So maybe that command will be more useful if it could take an argument that characterize X, i.e. how far behind can the broker be before a shutdown.

        Show
        Alexis Midon added a comment - - edited Consiering that Kafka is designed to handle some replication lag, if you need to shutdown a broker it does not seem very useful to wait for the replica lag to be zero. (If the broker is X messages behind, and my maintenance requires Y=F(message throughput) minutes, I can safely shutdown the broker is X+Y/throughput < replica.lag.max.messages. So maybe that command will be more useful if it could take an argument that characterize X, i.e. how far behind can the broker be before a shutdown.
        Hide
        Joel Koshy added a comment -

        Understood, but the primary use case would be to proceed to do a controlled
        shutdown of the next broker in the shutdown plan. However, with retries and
        a large enough retry interval that is not needed. (E.g., you can set a very
        large number of retries.)

        The documentation recommends closely monitoring under-replicated-partition
        counts across the cluster (and alert if it is anything other than zero).
        i.e., ensuring brokers are in a fully replicated state is a "best-practice"
        for operations and should be 24/7 (not just during bounces).

        Show
        Joel Koshy added a comment - Understood, but the primary use case would be to proceed to do a controlled shutdown of the next broker in the shutdown plan. However, with retries and a large enough retry interval that is not needed. (E.g., you can set a very large number of retries.) The documentation recommends closely monitoring under-replicated-partition counts across the cluster (and alert if it is anything other than zero). i.e., ensuring brokers are in a fully replicated state is a "best-practice" for operations and should be 24/7 (not just during bounces).
        Hide
        Brenden Matthews added a comment -

        This tool is orthogonal to the controlled shutdown tool. This is to help ensure that, once a broker comes online, it is in a fully replicated state.

        Show
        Brenden Matthews added a comment - This tool is orthogonal to the controlled shutdown tool. This is to help ensure that, once a broker comes online, it is in a fully replicated state.
        Hide
        Joel Koshy added a comment -

        Is this needed given that controlled shutdown is inbuilt into the broker? The retry counts and retry intervals are also configurable.

        Show
        Joel Koshy added a comment - Is this needed given that controlled shutdown is inbuilt into the broker? The retry counts and retry intervals are also configurable.
        Hide
        Brenden Matthews added a comment -

        Bump!

        Anyone interested in this? Presumably this would be valuable to others.

        Show
        Brenden Matthews added a comment - Bump! Anyone interested in this? Presumably this would be valuable to others.

          People

          • Assignee:
            Unassigned
            Reporter:
            Brenden Matthews
          • Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development