Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-14292

KRaft broker controlled shutdown can be delayed indefinitely

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.4.0, 3.3.2
    • None
    • None

    Description

      We noticed when rolling a kraft cluster that it took an unexpectedly long time for one of the brokers to shutdown. In the logs, we saw the following:

      Oct 11, 2022 @ 17:53:38.277	[Controller 1] The request from broker 8 to shut down can not yet be granted because the lowest active offset 2283357 is not greater than the broker's shutdown offset 2283358.	org.apache.kafka.controller.BrokerHeartbeatManager	DEBUG	
      2Oct 11, 2022 @ 17:53:38.277	[Controller 1] Updated the controlled shutdown offset for broker 8 to 2283362.	org.apache.kafka.controller.BrokerHeartbeatManager	DEBUG	
      3Oct 11, 2022 @ 17:53:40.278	[Controller 1] Updated the controlled shutdown offset for broker 8 to 2283366.	org.apache.kafka.controller.BrokerHeartbeatManager	DEBUG	
      4Oct 11, 2022 @ 17:53:40.278	[Controller 1] The request from broker 8 to shut down can not yet be granted because the lowest active offset 2283361 is not greater than the broker's shutdown offset 2283362.	org.apache.kafka.controller.BrokerHeartbeatManager	DEBUG	
      5Oct 11, 2022 @ 17:53:42.279	[Controller 1] The request from broker 8 to shut down can not yet be granted because the lowest active offset 2283365 is not greater than the broker's shutdown offset 2283366.	org.apache.kafka.controller.BrokerHeartbeatManager	DEBUG	
      6Oct 11, 2022 @ 17:53:42.279	[Controller 1] Updated the controlled shutdown offset for broker 8 to 2283370.	org.apache.kafka.controller.BrokerHeartbeatManager	DEBUG	
      7Oct 11, 2022 @ 17:53:44.280	[Controller 1] The request from broker 8 to shut down can not yet be granted because the lowest active offset 2283369 is not greater than the broker's shutdown offset 2283370.	org.apache.kafka.controller.BrokerHeartbeatManager	DEBUG	
      8Oct 11, 2022 @ 17:53:44.281	[Controller 1] Updated the controlled shutdown offset for broker 8 to 2283374.	org.apache.kafka.controller.BrokerHeartbeatManager	DEBUG	 

      From what I can tell, it looks like the controller waits until all brokers have caught up to the controlledShutdownOffset of the broker that is shutting down before allowing it to proceed. Probably the intent is to make sure they have all the leader and ISR state.

      The problem is that the controlledShutdownOffset seems to be updated after every heartbeat that the controller receives: https://github.com/apache/kafka/blob/trunk/metadata/src/main/java/org/apache/kafka/controller/QuorumController.java#L1996. Unless all other brokers can catch up to that offset before the next heartbeat from the shutting down broker is received, then the broker remains in the shutting down state indefinitely.

      In this case, it took more than 40 minutes before the broker completed shutdown:

      1Oct 11, 2022 @ 18:36:36.105	[Controller 1] The request from broker 8 to shut down has been granted since the lowest active offset 2288510 is now greater than the broker's controlled shutdown offset 2288510.	org.apache.kafka.controller.BrokerHeartbeatManager	INFO	
      2Oct 11, 2022 @ 18:40:35.197	[Controller 1] The request from broker 8 to unfence has been granted because it has caught up with the offset of it's register broker record 2288906.	org.apache.kafka.controller.BrokerHeartbeatManager	INFO

      It seems like the bug here is that we should not keep updating controlledShutdownOffset if it has already been set.

      Attachments

        Issue Links

          Activity

            People

              alyssahuang Alyssa Huang
              hachikuji Jason Gustafson
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: