Kafka
  1. Kafka
  2. KAFKA-1030

Addition of partitions requires bouncing all the consumers of that topic

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.8.0
    • Fix Version/s: 0.8.0
    • Component/s: None
    • Labels:
      None

      Description

      Consumer may not notice new partitions because the propagation of the metadata to servers can be delayed.

      Options:
      1. As Jun suggested on KAFKA-956, the easiest fix would be to read the new partition data from zookeeper instead of a kafka server.
      2. Run a fetch metadata loop in consumer, and set auto.offset.reset to smallest once the consumer has started.

      1 sounds easier to do. If 1 causes long delays in reading all partitions at the start of every rebalance, 2 may be worth considering.

      The same issue affects MirrorMaker when new topics are created, MirrorMaker may not notice all partitions of the new topics until the next rebalance.

        Activity

        Hide
        Swapnil Ghike added a comment -

        As Jun suggested on KAFKA-956, the easiest fix would be to read the new partition data from zookeeper instead of a kafka server.

        Show
        Swapnil Ghike added a comment - As Jun suggested on KAFKA-956 , the easiest fix would be to read the new partition data from zookeeper instead of a kafka server.
        Hide
        Neha Narkhede added a comment -

        One alternative is to change the topic metadata request handling on the controller broker and let the consumer re-issue a metadata request to the controller broker when the partition change listener fires. If the broker is the controller, it should serve the metadata request by reading from the controller cache directly. If not, it can rely on leaderCache that is updated via the UpdateMetadata request. Upside of this approach is that it won't kill performance and will solve the problem. Downside is that it might make the metadata request handling on the controller broker somewhat slower since it invovlves locking on the controller lock.

        Show
        Neha Narkhede added a comment - One alternative is to change the topic metadata request handling on the controller broker and let the consumer re-issue a metadata request to the controller broker when the partition change listener fires. If the broker is the controller, it should serve the metadata request by reading from the controller cache directly. If not, it can rely on leaderCache that is updated via the UpdateMetadata request. Upside of this approach is that it won't kill performance and will solve the problem. Downside is that it might make the metadata request handling on the controller broker somewhat slower since it invovlves locking on the controller lock.
        Hide
        Swapnil Ghike added a comment -

        Hmm, this will mean that the consumer client will cease to be controller agnostic. Is that a good idea? Plus if there is a controller failover at the same time as a consumer trying to fetch metadata, the broker the consumer was talking to for fetching metadata may have stale metadata. So, we may need to implement a controller failover watcher on consumer to trigger fetching metadata. Thoughts?

        Show
        Swapnil Ghike added a comment - Hmm, this will mean that the consumer client will cease to be controller agnostic. Is that a good idea? Plus if there is a controller failover at the same time as a consumer trying to fetch metadata, the broker the consumer was talking to for fetching metadata may have stale metadata. So, we may need to implement a controller failover watcher on consumer to trigger fetching metadata. Thoughts?
        Hide
        Neha Narkhede added a comment -

        Longer term, the right fix will be to move rebalancing to the controller and let it co-ordinate state changes for the consumer. Until then, it will be some sort of a work around to get the latest state changes. To your point, if controller failover is happening, the rebalance attempt will fail. This is no different from leadership changes. I don't see why we need controller failover watch. This controller metadata is only required when partitions change, so it is on-demand.

        Show
        Neha Narkhede added a comment - Longer term, the right fix will be to move rebalancing to the controller and let it co-ordinate state changes for the consumer. Until then, it will be some sort of a work around to get the latest state changes. To your point, if controller failover is happening, the rebalance attempt will fail. This is no different from leadership changes. I don't see why we need controller failover watch. This controller metadata is only required when partitions change, so it is on-demand.
        Hide
        Guozhang Wang added a comment -

        I think the first option might work just okay, since the consumer do not actually needs the partition leader id, etc. All it needs is the map of topic -> list of partition ids. This can be done by just reading one ZK path per topic: /brokers/topics/[topic]. Of course this will put more pressure on MirrorMaker though, but we should really not do full rebalance for added partition or topic anyways..

        Show
        Guozhang Wang added a comment - I think the first option might work just okay, since the consumer do not actually needs the partition leader id, etc. All it needs is the map of topic -> list of partition ids. This can be done by just reading one ZK path per topic: /brokers/topics/ [topic] . Of course this will put more pressure on MirrorMaker though, but we should really not do full rebalance for added partition or topic anyways..
        Hide
        Guozhang Wang added a comment -
        Show
        Guozhang Wang added a comment - Updated reviewboard https://reviews.apache.org/r/14041/
        Hide
        Guozhang Wang added a comment -

        Here are the performance testing results:

        Setup: 1) 5 instances of mirror maker consuming from around 3800 topic/partitions, 2) 1 instance of console consumer consuming from around 300 topic/partitions.

        1). Bouncing mirror makers:

        ZK-located-in-same-DC: 4 minutes and 20 seconds with the fix

        ZK-located-in-same-DC: 3 minutes 50 secs without the fix

        ZK-located-in-other-DC: 8 minutes 2 seconds with the fix

        ZK-located-in-other-DC: 7 minutes 6 seconds without the fix

        2). Bouncing console consumer

        ZK-located-in-same-DC: 15 seconds with the fix

        ZK-located-in-same-DC: 15 seconds without the fix

        ---------------

        Given the results, I think it worth pushing this approach (read-from-ZK) in 0.8 and we can later pursue the other approach Joel proposed in the reviewboard in trunk.

        Show
        Guozhang Wang added a comment - Here are the performance testing results: Setup: 1) 5 instances of mirror maker consuming from around 3800 topic/partitions, 2) 1 instance of console consumer consuming from around 300 topic/partitions. 1). Bouncing mirror makers: ZK-located-in-same-DC: 4 minutes and 20 seconds with the fix ZK-located-in-same-DC: 3 minutes 50 secs without the fix ZK-located-in-other-DC: 8 minutes 2 seconds with the fix ZK-located-in-other-DC: 7 minutes 6 seconds without the fix 2). Bouncing console consumer ZK-located-in-same-DC: 15 seconds with the fix ZK-located-in-same-DC: 15 seconds without the fix --------------- Given the results, I think it worth pushing this approach (read-from-ZK) in 0.8 and we can later pursue the other approach Joel proposed in the reviewboard in trunk.
        Hide
        Swapnil Ghike added a comment -

        +1 that Guozhang, thanks for running the tests.

        Show
        Swapnil Ghike added a comment - +1 that Guozhang, thanks for running the tests.
        Hide
        Neha Narkhede added a comment -

        Thanks for the updated patch and the performance comparison analysis. I agree that the ideal change might prove to be too large for 0.8 and will require non-trivial amount of time stabilizing it since it is fairly tricky. We can just do it properly on trunk and live with this minor performance hit for consumer rebalancing on 0.8.

        Show
        Neha Narkhede added a comment - Thanks for the updated patch and the performance comparison analysis. I agree that the ideal change might prove to be too large for 0.8 and will require non-trivial amount of time stabilizing it since it is fairly tricky. We can just do it properly on trunk and live with this minor performance hit for consumer rebalancing on 0.8.
        Hide
        Neha Narkhede added a comment -


        Checked in the latest patch to 0.8

        Show
        Neha Narkhede added a comment - Checked in the latest patch to 0.8

          People

          • Assignee:
            Guozhang Wang
            Reporter:
            Swapnil Ghike
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development