Kafka
  1. Kafka
  2. KAFKA-849

Bug in controller's startup/failover logic fails to update in memory leader and isr cache causing other state changes to work incorrectly

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.8.0
    • Fix Version/s: None
    • Component/s: controller
    • Labels:

      Description

      partitionLeadershipInfo is the in memory cache of the controller that keeps track of every partition's "last elected" leader and isr. On controller startup/failover, this cache is bootstrapped only with those partitions whose leader is alive. This causes the leader and isr cache to be initialized incorrectly causing other state transitions related to new broker startup, existing broker failure to not work correctly. For instance, it does not allow the controller to send the list of all replicas that exist on a broker to it during startup.

      Another bug during controller startup is that it invokes OnlinePartition state change before OnlineReplica state change. This also breaks the guarantee that the controller sends a full list of replicas to a broker on startup

      1. kafka-849-v1.patch
        17 kB
        Neha Narkhede

        Activity

        Neha Narkhede created issue -
        Neha Narkhede made changes -
        Field Original Value New Value
        Status Open [ 1 ] In Progress [ 3 ]
        Hide
        Neha Narkhede added a comment -

        Fixed the bug so that leader and isr cache is updated whether or not the leader is alive. This is the right thing to do since the purpose of the cache is to record the last decision made. On controller failover, this is the value read from zookeeper.

        Other than that, fixed couple other issues -

        1. Changed list topics tool to also print whether or not the partition is under replicated. This makes it very easy to script the output of list topics to show only partitions that are under replicated
        2. Reduced the noise in the logs due to failed metadata requests. There is not much value in logging this since when some brokers are down, the stack trace just complains that those brokers are down. We still return the correct error code to the client, so turned this error message to debug

        Show
        Neha Narkhede added a comment - Fixed the bug so that leader and isr cache is updated whether or not the leader is alive. This is the right thing to do since the purpose of the cache is to record the last decision made. On controller failover, this is the value read from zookeeper. Other than that, fixed couple other issues - 1. Changed list topics tool to also print whether or not the partition is under replicated. This makes it very easy to script the output of list topics to show only partitions that are under replicated 2. Reduced the noise in the logs due to failed metadata requests. There is not much value in logging this since when some brokers are down, the stack trace just complains that those brokers are down. We still return the correct error code to the client, so turned this error message to debug
        Neha Narkhede made changes -
        Attachment kafka-849-v1.patch [ 12577103 ]
        Neha Narkhede made changes -
        Attachment kafka-849-v1.patch [ 12577103 ]
        Neha Narkhede made changes -
        Attachment kafka-849-v1.patch [ 12577105 ]
        Hide
        Jun Rao added a comment -

        Thanks for the patch. +1. The changes related to list topic are not sufficient though. The problem is that if a broker is down, AdminUtils.fetchTopicMetadataFromZk returns an empty replica list. There is a patch in kafka-850 that fixes this issue more completely.

        Show
        Jun Rao added a comment - Thanks for the patch. +1. The changes related to list topic are not sufficient though. The problem is that if a broker is down, AdminUtils.fetchTopicMetadataFromZk returns an empty replica list. There is a patch in kafka-850 that fixes this issue more completely.
        Hide
        Swapnil Ghike added a comment -

        +1 Both were great catches!

        Show
        Swapnil Ghike added a comment - +1 Both were great catches!
        Hide
        Neha Narkhede added a comment -

        Thanks Jun and Swapnil for the quick review! Jun, I agree that these changes are not complete. I saw that you have it covered in 850 so left it from here.

        Show
        Neha Narkhede added a comment - Thanks Jun and Swapnil for the quick review! Jun, I agree that these changes are not complete. I saw that you have it covered in 850 so left it from here.
        Neha Narkhede made changes -
        Status In Progress [ 3 ] Patch Available [ 10002 ]
        Hide
        Neha Narkhede added a comment -

        Committed to 0.8

        Show
        Neha Narkhede added a comment - Committed to 0.8
        Neha Narkhede made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Neha Narkhede made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Tony Stevenson made changes -
        Workflow no-reopen-closed, patch-avail [ 12775123 ] Apache Kafka Workflow [ 13052999 ]
        Tony Stevenson made changes -
        Workflow Apache Kafka Workflow [ 13052999 ] no-reopen-closed, patch-avail [ 13055037 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open In Progress In Progress
        16h 58m 1 Neha Narkhede 04/Apr/13 23:10
        In Progress In Progress Patch Available Patch Available
        5h 17m 1 Neha Narkhede 05/Apr/13 04:27
        Patch Available Patch Available Resolved Resolved
        14s 1 Neha Narkhede 05/Apr/13 04:27
        Resolved Resolved Closed Closed
        7s 1 Neha Narkhede 05/Apr/13 04:27

          People

          • Assignee:
            Neha Narkhede
            Reporter:
            Neha Narkhede
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development