Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-13407

Kafka controller out of service after ZK leader restart

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 2.8.0, 2.8.1
    • None
    • None
    • None
    • Ubuntu 20.04

    Description

      When the Zookeeper leader disappears, a new instance becomes the leader, the instances need to reconnect to Zookeeper, but the Kafka "Controller" gets lost in limbo state after re-establishing connection.

      See below for how I manage to reproduce this over and over.

      Prerequisites

      Have a Kafka cluster with 3 instances running version 2.8.1. Figure out which one is the Controller. I'm using Kafkacat 1.5.0 and get this info using the `-L` flag.

      Zookeeper runs with 3 instances on version 3.5.9. Figure out which one is leader by checking

       

      echo stat | nc -v localhost 2181
      

       

       

      Reproduce

      1. Stop the leader Zookeeper service.

      2. Watch the logs of the Kafka Controller and ensure that it reconnects and registers again.

       

      Oct 27 09:13:08 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:08,882] INFO Unable to read additional data from server sessionid 0x1f2a12870003, likely server has closed socket, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
      Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,548] WARN SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/opt/kafka/config/kafka_server_jaas.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. (org.apache.zookeeper.ClientCnxn)
      Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,548] INFO Opening socket connection to server zookeeper-kafka.service.consul.lab.aws.blue.example.net/10.10.84.12:2181 (org.apache.zookeeper.ClientCnxn)
      Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,548] ERROR [ZooKeeperClient Kafka server] Auth failed. (kafka.zookeeper.ZooKeeperClient)
      Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,549] INFO Socket connection established, initiating session, client: /10.10.85.215:39338, server: zookeeper-kafka.service.consul.lab.aws.blue.example.net/10.10.84.12:2181 (org.apache.zookeeper.ClientCnxn)
      Oct 27 09:13:10 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:10,569] INFO Session establishment complete on server zookeeper-kafka.service.consul.lab.aws.blue.example.net/10.10.84.12:2181, sessionid = 0x1f2a12870003, negotiated timeout = 18000 (org.apache.zookeeper.ClientCnxn)
      Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,548] INFO [ZooKeeperClient Kafka server] Reinitializing due to auth failure. (kafka.zookeeper.ZooKeeperClient)
      Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,550] INFO [PartitionStateMachine controllerId=1003] Stopped partition state machine (kafka.controller.ZkPartitionStateMachine)
      Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,550] INFO [ReplicaStateMachine controllerId=1003] Stopped replica state machine (kafka.controller.ZkReplicaStateMachine)
      Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,551] INFO [RequestSendThread controllerId=1003] Shutting down (kafka.controller.RequestSendThread)
      Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,551] INFO [RequestSendThread controllerId=1003] Stopped (kafka.controller.RequestSendThread)
      Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,551] INFO [RequestSendThread controllerId=1003] Shutdown completed (kafka.controller.RequestSendThread)
      Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,552] INFO [RequestSendThread controllerId=1003] Shutting down (kafka.controller.RequestSendThread)
      Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,552] INFO [RequestSendThread controllerId=1003] Stopped (kafka.controller.RequestSendThread)
      Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,552] INFO [RequestSendThread controllerId=1003] Shutdown completed (kafka.controller.RequestSendThread)
      Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,554] INFO [RequestSendThread controllerId=1003] Shutting down (kafka.controller.RequestSendThread)
      Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,554] INFO [RequestSendThread controllerId=1003] Stopped (kafka.controller.RequestSendThread)
      Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,554] INFO [RequestSendThread controllerId=1003] Shutdown completed (kafka.controller.RequestSendThread)
      Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,556] INFO Processing notification(s) to /config/changes (kafka.common.ZkNodeChangeNotificationListener)
      Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,557] INFO [Controller id=1003] Resigned (kafka.controller.KafkaController)
      Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,557] INFO Creating /brokers/ids/1003 (is it secure? false) (kafka.zk.KafkaZkClient)
      Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,560] INFO Updated cache from existing FinalizedFeaturesAndEpoch(features=Features{}, epoch=1) to latest FinalizedFeaturesAndEpoch(features=Features{}, epoch=1). (kafka.server.FinalizedFeatureCache)
      Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,572] INFO Stat of the created znode at /brokers/ids/1003 is: 128849019044,128849019044,1635323000996,1635323000996,1,0,0,34265559924739,389,0,128849019044
      Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]:  (kafka.zk.KafkaZkClient)
      Oct 27 09:13:11 ip-10-10-85-215 kafka[62961]: [2021-10-27 09:13:11,572] INFO Registered broker 1003 at path /brokers/ids/1003 with addresses: PLAINTEXT://kafka-tr-2.node.consul.lab.aws.blue.example.net:9093,SASL_SSL://kafka-tr-2.node.consul.lab.aws.blue.example.net:9092, czxid (broker epoch): 128849019044 (kafka.zk.KafkaZkClient)
      

       

       

      Now it is in this limbo state but to notice the problem create a topic for example.

       

      ./kafka-topics.sh --zookeeper zookeeper-kafka.service.consul.lab.aws.blue.example.net:2181/kafka_tr --create --topic danols-ts-4 --partitions 10 --replication-factor 3
      Created topic danols-ts-4.
      

       

       

      Then when describing the topic it looks like this:

       

      ./kafka-topics.sh --zookeeper zookeeper-kafka.service.consul.lab.aws.blue.example.net:2181/kafka_tr --describe --topic danols-ts-4
      Topic: danols-ts-4	PartitionCount: 10	ReplicationFactor: 3	Configs: 
      	Topic: danols-ts-4	Partition: 0	Leader: none	Replicas: 1003,1001,1002	Isr: 
      	Topic: danols-ts-4	Partition: 1	Leader: none	Replicas: 1001,1002,1003	Isr: 
      	Topic: danols-ts-4	Partition: 2	Leader: none	Replicas: 1002,1003,1001	Isr: 
      	Topic: danols-ts-4	Partition: 3	Leader: none	Replicas: 1003,1002,1001	Isr: 
      	Topic: danols-ts-4	Partition: 4	Leader: none	Replicas: 1001,1003,1002	Isr: 
      	Topic: danols-ts-4	Partition: 5	Leader: none	Replicas: 1002,1001,1003	Isr: 
      	Topic: danols-ts-4	Partition: 6	Leader: none	Replicas: 1003,1001,1002	Isr: 
      	Topic: danols-ts-4	Partition: 7	Leader: none	Replicas: 1001,1002,1003	Isr: 
      	Topic: danols-ts-4	Partition: 8	Leader: none	Replicas: 1002,1003,1001	Isr: 
      	Topic: danols-ts-4	Partition: 9	Leader: none	Replicas: 1003,1002,1001	Isr: 
      
      

       

       

      If I describe using the --bootstrap-server flag instead it doesn't exist

      ./kafka-topics.sh --bootstrap-server localhost:9093 --describe --topic danols-ts-4
      Error while executing topic command : Topic 'danols-ts-4' does not exist as expected
      [2021-10-27 09:24:15,691] ERROR java.lang.IllegalArgumentException: Topic 'danols-ts-4' does not exist as expected
      	at kafka.admin.TopicCommand$.kafka$admin$TopicCommand$$ensureTopicExists(TopicCommand.scala:542)
      	at kafka.admin.TopicCommand$AdminClientTopicService.describeTopic(TopicCommand.scala:317)
      	at kafka.admin.TopicCommand$.main(TopicCommand.scala:69)
      	at kafka.admin.TopicCommand.main(TopicCommand.scala)
       (kafka.admin.TopicCommand$)
      

      If I restart the Kafka service on a broker that is not "Controller" it doesn't come up again. It just spams the following in the log for all different partitions and it's not possible to connect to it.

      Oct 27 09:27:48 ip-10-226-69-122 kafka[67342]: org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 3 larger than available brokers: 0.
      Oct 27 09:27:48 ip-10-226-69-122 kafka[67342]: [2021-10-27 09:27:48,984] INFO [Admin Manager on Broker 1001]: Error processing create topic request CreatableTopic(name='example-message-consumer', numPartitions=10, replicationFactor=3, assignments=[], configs=[]) (kafka.server.ZkAdminManager)
      Oct 27 09:27:48 ip-10-226-69-122 kafka[67342]: org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 3 larger than available brokers: 0.
      Oct 27 09:27:49 ip-10-226-69-122 kafka[67342]: [2021-10-27 09:27:49,021] INFO [Admin Manager on Broker 1001]: Error processing create topic request CreatableTopic(name='__consumer_offsets', numPartitions=50, replicationFactor=3, assignments=[], configs=[CreateableTopicConfig(name='compression.type', value='producer'), CreateableTopicConfig(name='cleanup.policy', value='compact'), CreateableTopicConfig(name='segment.bytes', value='104857600')]) (kafka.server.ZkAdminManager)
      

      To get out of this state I simply restart the Kafka service on the "Controller". It will make another become "Controller" and all brokers becomes OK again and the newly created topic gets all expected ISRs.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Olsson Daniel
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: