Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-7331

Kafka does not detect broker loss in the event of a network partition within the cluster

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.0.1
    • None
    • controller, network
    • None

    Description

      We ran into this issue on our production cluster and had to manually remove the broker and enable unclean leader elections to get the cluster working again. Ideally, Kafka itself could handle network partitions without manual intervention.

      The issue is reproducible with the following cross datacenter Kafka cluster setup:
      DC 1: Kafka brokers + ZK nodes
      DC 2: Kafka brokers + ZK nodes
      DC 3: Kafka brokers + ZK nodes

      Introduce a network partition on a Kafka broker (brokerA) in DC 1 where it cannot reach any hosts (brokers and ZK nodes) in the other 2 datacenters. The cluster goes into a state where partitions that brokerA is a leader for will only contain brokerA in its ISR. Since brokerA is still reachable by ZK nodes in DC 1, it still shows up when querying ZK. The controller thinks brokerA is still up and does not elect new leaders for partitions that brokerA is a leader for. This causes all those partitions to be down until brokerA is back or completely removed from the cluster (in which case unclean leader election can elect new leaders for those partitions).

      A faster recovery scenario could be for a majority of hosts (zk nodes?) to realize that brokerA is unreachable, and mark it as down so elections for partitions it is a leader for could be triggered. This avoids waiting indefinitely for the broker to come back or taking action to remove the broker from the cluster.

      Attachments

        Activity

          People

            Unassigned Unassigned
            kevkli Kevin Li
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: