[KAFKA-14292] KRaft broker controlled shutdown can be delayed indefinitely - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.4.0, 3.3.2
Component/s: None
Labels:
None

Description

We noticed when rolling a kraft cluster that it took an unexpectedly long time for one of the brokers to shutdown. In the logs, we saw the following:

Oct 11, 2022 @ 17:53:38.277	[Controller 1] The request from broker 8 to shut down can not yet be granted because the lowest active offset 2283357 is not greater than the broker's shutdown offset 2283358.	org.apache.kafka.controller.BrokerHeartbeatManager	DEBUG	
2Oct 11, 2022 @ 17:53:38.277	[Controller 1] Updated the controlled shutdown offset for broker 8 to 2283362.	org.apache.kafka.controller.BrokerHeartbeatManager	DEBUG	
3Oct 11, 2022 @ 17:53:40.278	[Controller 1] Updated the controlled shutdown offset for broker 8 to 2283366.	org.apache.kafka.controller.BrokerHeartbeatManager	DEBUG	
4Oct 11, 2022 @ 17:53:40.278	[Controller 1] The request from broker 8 to shut down can not yet be granted because the lowest active offset 2283361 is not greater than the broker's shutdown offset 2283362.	org.apache.kafka.controller.BrokerHeartbeatManager	DEBUG	
5Oct 11, 2022 @ 17:53:42.279	[Controller 1] The request from broker 8 to shut down can not yet be granted because the lowest active offset 2283365 is not greater than the broker's shutdown offset 2283366.	org.apache.kafka.controller.BrokerHeartbeatManager	DEBUG	
6Oct 11, 2022 @ 17:53:42.279	[Controller 1] Updated the controlled shutdown offset for broker 8 to 2283370.	org.apache.kafka.controller.BrokerHeartbeatManager	DEBUG	
7Oct 11, 2022 @ 17:53:44.280	[Controller 1] The request from broker 8 to shut down can not yet be granted because the lowest active offset 2283369 is not greater than the broker's shutdown offset 2283370.	org.apache.kafka.controller.BrokerHeartbeatManager	DEBUG	
8Oct 11, 2022 @ 17:53:44.281	[Controller 1] Updated the controlled shutdown offset for broker 8 to 2283374.	org.apache.kafka.controller.BrokerHeartbeatManager	DEBUG

From what I can tell, it looks like the controller waits until all brokers have caught up to the controlledShutdownOffset of the broker that is shutting down before allowing it to proceed. Probably the intent is to make sure they have all the leader and ISR state.

The problem is that the controlledShutdownOffset seems to be updated after every heartbeat that the controller receives: https://github.com/apache/kafka/blob/trunk/metadata/src/main/java/org/apache/kafka/controller/QuorumController.java#L1996. Unless all other brokers can catch up to that offset before the next heartbeat from the shutting down broker is received, then the broker remains in the shutting down state indefinitely.

In this case, it took more than 40 minutes before the broker completed shutdown:

1Oct 11, 2022 @ 18:36:36.105	[Controller 1] The request from broker 8 to shut down has been granted since the lowest active offset 2288510 is now greater than the broker's controlled shutdown offset 2288510.	org.apache.kafka.controller.BrokerHeartbeatManager	INFO	
2Oct 11, 2022 @ 18:40:35.197	[Controller 1] The request from broker 8 to unfence has been granted because it has caught up with the offset of it's register broker record 2288906.	org.apache.kafka.controller.BrokerHeartbeatManager	INFO

It seems like the bug here is that we should not keep updating controlledShutdownOffset if it has already been set.

Attachments

Issue Links

links to

GitHub Pull Request #12736

Activity

People

Assignee:: Alyssa Huang

Reporter:: Jason Gustafson

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/Oct/22 00:16

Updated:: 13/Oct/22 20:33

Resolved:: 13/Oct/22 20:33