Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 0.9.0.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Currently rolling bounce a Kafka cluster with tens of thousands of partitions can take very long (~2 min for each broker with ~5000 partitions/broker in our environment). The majority of the time is spent on shutting down a broker. The time of shutting down a broker usually includes the following parts:

      T1: During controlled shutdown, people usually want to make sure there is no under replicated partitions. So shutting down a broker during a rolling bounce will have to wait for the previous restarted broker to catch up. This is T1.

      T2: The time to send controlled shutdown request and receive controlled shutdown response. Currently the a controlled shutdown request will trigger many LeaderAndIsrRequest and UpdateMetadataRequest. And also involving many zookeeper update in serial.

      T3: The actual time to shutdown all the components. It is usually small compared with T1 and T2.

      T1 is related to:
      A) the inbound throughput on the cluster, and
      B) the "down" time of the broker (time between replica fetchers stop and replica fetchers restart)
      The larger the traffic is, or the longer the broker stopped fetching, the longer it will take for the broker to catch up and get back into ISR. Therefore the longer T1 will be. Assume:

      • the in bound network traffic is X bytes/second on a broker
      • the time T1.B ("down" time) mentioned above is T
        Theoretically it will take (X * T) / (NetworkBandwidth - X) = InBoundNetworkUtilization * T / (1 - InboundNetworkUtilization) for a the broker to catch up after the restart. While X is out of our control, T is largely related to T2.

      The purpose of this ticket is to reduce T2 by:
      1. Batching the LeaderAndIsrRequest and UpdateMetadataRequest during controlled shutdown.
      2. Use async zookeeper write to pipeline zookeeper writes. According to Zookeeper wiki(https://wiki.apache.org/hadoop/ZooKeeper/Performance), a 3 node ZK cluster should be able to handle 20K writes (1K size). So if we use async write, likely we will be able to reduce zookeeper update time to lower seconds or even sub-second level.

        Issue Links

          Activity

          Hide
          junrao Jun Rao added a comment -

          This is now fixed in KAFKA-5642.

          Show
          junrao Jun Rao added a comment - This is now fixed in KAFKA-5642 .
          Hide
          guozhang Guozhang Wang added a comment -

          Reminder to the contributor / reviewer of the PR: please note that the code deadline for 1.0.0 is less than 2 weeks away (Oct. 4th). Please re-evaluate your JIRA and see if it still makes sense to be merged into 1.0.0 or it could be pushed out to 1.1.0, or be closed directly if the JIRA itself is not valid any more, or re-assign yourself as contributor / committer if you are no longer working on the JIRA.

          Show
          guozhang Guozhang Wang added a comment - Reminder to the contributor / reviewer of the PR : please note that the code deadline for 1.0.0 is less than 2 weeks away (Oct. 4th). Please re-evaluate your JIRA and see if it still makes sense to be merged into 1.0.0 or it could be pushed out to 1.1.0, or be closed directly if the JIRA itself is not valid any more, or re-assign yourself as contributor / committer if you are no longer working on the JIRA.
          Hide
          stephane.maarek@gmail.com Stephane Maarek added a comment -

          Onur Karaman I can't find the JIRA / PR regarding your re-write of the controller. Would that address https://issues.apache.org/jira/browse/KAFKA-3083 too?

          Show
          stephane.maarek@gmail.com Stephane Maarek added a comment - Onur Karaman I can't find the JIRA / PR regarding your re-write of the controller. Would that address https://issues.apache.org/jira/browse/KAFKA-3083 too?
          Hide
          becket_qin Jiangjie Qin added a comment -

          Sorry you are right. I was under the impression that patch was merged... The latest code does not have the optimization...

          I don't think Onur has a public doc available. As a likely-outdated reference you can take a look at this:
          https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Controller+Redesign

          Show
          becket_qin Jiangjie Qin added a comment - Sorry you are right. I was under the impression that patch was merged... The latest code does not have the optimization... I don't think Onur has a public doc available. As a likely-outdated reference you can take a look at this: https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Controller+Redesign
          Hide
          wushujames James Cheng added a comment -

          I haven't had a chance to try trunk. Is it on trunk but not in 0.10.2?

          Can you point me to the PR's that did batching of the partitions? I found https://issues.apache.org/jira/browse/KAFKA-4444, but I don't see any indication that it was merged into the code base.

          Also, are there any docs on Onur Karaman's rewrite of the controller? I'd like to read them.

          Show
          wushujames James Cheng added a comment - I haven't had a chance to try trunk. Is it on trunk but not in 0.10.2? Can you point me to the PR's that did batching of the partitions? I found https://issues.apache.org/jira/browse/KAFKA-4444 , but I don't see any indication that it was merged into the code base. Also, are there any docs on Onur Karaman 's rewrite of the controller? I'd like to read them.
          Hide
          becket_qin Jiangjie Qin added a comment -

          Onur Karaman is currently working on rewrite controller. The latest trunk already has some controlled shutdown performance improvement by batching the partitions. Have you got a chance to try?

          Show
          becket_qin Jiangjie Qin added a comment - Onur Karaman is currently working on rewrite controller. The latest trunk already has some controlled shutdown performance improvement by batching the partitions. Have you got a chance to try?
          Hide
          wushujames James Cheng added a comment -

          Is there a chance this can be worked on for 0.10.3? We have a cluster with 10,000 partitions per broker. It regularly takes around 8 minutes to shutdown a broker.

          Show
          wushujames James Cheng added a comment - Is there a chance this can be worked on for 0.10.3? We have a cluster with 10,000 partitions per broker. It regularly takes around 8 minutes to shutdown a broker.
          Hide
          becket_qin Jiangjie Qin added a comment -

          Thanks Ismael. I took brief look at KAFKA-3028 patch. It seems using a different approach. I also took a look at KAFKA-3083. I think the approach in this patch might be able to address the concern in both tickets. I will update the patch to address KAFKA-3083 and see how it goes.

          Show
          becket_qin Jiangjie Qin added a comment - Thanks Ismael. I took brief look at KAFKA-3028 patch. It seems using a different approach. I also took a look at KAFKA-3083 . I think the approach in this patch might be able to address the concern in both tickets. I will update the patch to address KAFKA-3083 and see how it goes.
          Hide
          ijuma Ismael Juma added a comment -

          Jiangjie Qin, with regards to async ZK, Eno Thereska provided a PR for KAFKA-3038 some time ago that took advantage of ZK's async API. Jun Rao was concerned about having two different ways of using the ZK API. I haven't checked your PR yet, but I thought I'd point this out so that you are aware.

          Show
          ijuma Ismael Juma added a comment - Jiangjie Qin , with regards to async ZK, Eno Thereska provided a PR for KAFKA-3038 some time ago that took advantage of ZK's async API. Jun Rao was concerned about having two different ways of using the ZK API. I haven't checked your PR yet, but I thought I'd point this out so that you are aware.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user becketqin opened a pull request:

          https://github.com/apache/kafka/pull/1149

          KAFKA-3436: Speed up controlled shutdown.

          This patch does the followings:
          1. Batched LeaderAndIsrRequest and UpdateMetadataRequest during controlled shutdown.
          2. Added async read and write method to an extending ZkClient. Used the async zk operation for LeaderAndIsr read and update. The async method can be used in other places as well (e.g. preferred leader election, replica reassignment, controller bootstrap, etc), but those are out of the scope of this ticket.

          Conducted some rolling boucne test, a controlled shutdown involving 2500 partitions takes around 3 seconds now. Previously it can takes more than 30 seconds.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/becketqin/kafka KAFKA-3436

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/kafka/pull/1149.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #1149


          commit c2d22821c6c3ad7aa45090def6b984719209f5af
          Author: Jiangjie Qin <becket.qin@gmail.com>
          Date: 2016-03-27T21:29:30Z

          KAFKA-3436: Speed up controlled shutdown

          commit 7e7cf3fb1fc4a44d7af4ea935b38bf2e90e6cadd
          Author: Jiangjie Qin <becket.qin@gmail.com>
          Date: 2016-03-28T00:47:22Z

          Remove pre-sent StopReplicaRequests and split state transition into multiple groups.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user becketqin opened a pull request: https://github.com/apache/kafka/pull/1149 KAFKA-3436 : Speed up controlled shutdown. This patch does the followings: 1. Batched LeaderAndIsrRequest and UpdateMetadataRequest during controlled shutdown. 2. Added async read and write method to an extending ZkClient. Used the async zk operation for LeaderAndIsr read and update. The async method can be used in other places as well (e.g. preferred leader election, replica reassignment, controller bootstrap, etc), but those are out of the scope of this ticket. Conducted some rolling boucne test, a controlled shutdown involving 2500 partitions takes around 3 seconds now. Previously it can takes more than 30 seconds. You can merge this pull request into a Git repository by running: $ git pull https://github.com/becketqin/kafka KAFKA-3436 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/1149.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1149 commit c2d22821c6c3ad7aa45090def6b984719209f5af Author: Jiangjie Qin <becket.qin@gmail.com> Date: 2016-03-27T21:29:30Z KAFKA-3436 : Speed up controlled shutdown commit 7e7cf3fb1fc4a44d7af4ea935b38bf2e90e6cadd Author: Jiangjie Qin <becket.qin@gmail.com> Date: 2016-03-28T00:47:22Z Remove pre-sent StopReplicaRequests and split state transition into multiple groups.

            People

            • Assignee:
              becket_qin Jiangjie Qin
              Reporter:
              becket_qin Jiangjie Qin
              Reviewer:
              Joel Koshy
            • Votes:
              2 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development