Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-6714

KafkaController marks all Brokers as "Shutting down", though only one broker has been shut down

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 0.11.0.2
    • None
    • controller, core
    • None
    • Kafka cluster on Amazon AWS EC2 r4.2xlarge instances with 5 nodes and a Zookeeper cluster on r4.2xlarge instances with 3 nodes. The cluster is distributed across 2 availability zones.

    Description

      In our Kafka cluster we experienced a situation in wich the Kafka controller has all Brokers marked as "Shutting down", though indeed only one Broker has been shut down.

      The last log entry about the broker state before the entry that states that all brokers are shutting down states that no brokers are shutting down.

      The consequence of this weird state is, that the Kafka controller is not able to elect any partition leader.

      kafka.controller Log (Level TRACE):
      [2018-03-15 16:28:24,288] INFO [Controller 5]: Shutting down broker 5 (kafka.controller.KafkaController)
      [2018-03-15 16:28:24,288] DEBUG [Controller 5]: All shutting down brokers: 5 (kafka.controller.KafkaController)
      [2018-03-15 16:28:24,288] DEBUG [Controller 5]: Live brokers: 1,2,3,4 (kafka.controller.KafkaController)
      ...
      [2018-03-15 16:28:36,846] INFO [Controller 3]: Currently active brokers in the cluster: Set(1, 2, 3, 4) (kafka.controller.KafkaController)
      [2018-03-15 16:28:36,846] INFO [Controller 3]: Currently shutting brokers in the cluster: Set() (kafka.controller.KafkaController)
      ...
      [2018-03-19 17:57:22,273] INFO [Controller 3]: Shutting down broker 1 (kafka.controller.KafkaController)
      [2018-03-19 17:57:22,273] DEBUG [Controller 3]: All shutting down brokers: 1,5,2,3,4 (kafka.controller.KafkaController)
      [2018-03-19 17:57:22,273] DEBUG [Controller 3]: Live brokers:  (kafka.controller.KafkaController)
      
      state.change.logger Log (Level TRACE):
      [2018-03-19 17:57:22,275] ERROR Controller 3 epoch 83 encountered error while electing leader for partition [zughaltphase_v3_intern_intern_partitioned_by_evanummer,6] due to: No other replicas in ISR 1,3,5 for [zughaltphase_v3_intern_intern_partitioned_by_evanummer,6] besides shutting down brokers 1,5,2,3,4. (state.change.logger) 

      The question is why the Kafka controller assumes that all brokers are shutting down?

      The only place in the Kafka code (0.11.0.2) we found in which the shutting down broker set is changed is in the class kafka.controller.KafkaControler in line 1407 in the method doControlledShutdown.

      info("Shutting down broker " + id)
      
      if (!controllerContext.liveOrShuttingDownBrokerIds.contains(id))
        throw new BrokerNotAvailableException("Broker id %d does not exist.".format(id))
      
      controllerContext.shuttingDownBrokerIds.add(id)
      

      However, we should see the log entry "Shutting down broker n" for all Brokers in the log file, but it is not there.

      This is a recurring problem, however we cannot reproduce it.

      Attachments

        Activity

          People

            Unassigned Unassigned
            ueisele Uwe Eisele
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: