Description
Hey,
Tested using 3.9.0 RC0. The issue only affects kraft.
It seems that "kafka-metadata-quorum.sh remove-controller" causes the removed controller to crash if it is one of the controllers specified using "--initial-controllers "
Steps to reproduce:
Clean up and setup the environment
rm -rf /tmp/controllers && \
mkdir -p /tmp/controllers/c1 && \
mkdir -p /tmp/controllers/c2 && \
mkdir -p /tmp/controllers/c3
export KAFKA_HOME=<your_kafka_3_9_home>
Format the controllers
$KAFKA_HOME/bin/kafka-storage.sh format --cluster-id 00000000-0000-0000-0000-000000000001 --initial-controllers 1001@localhost:10001:AAAAAAAAAAEAAAAAAAAAAA,1002@localhost:10002:AAAAAAAAAAEAAAAAAAAAAA,1003@localhost:10003:AAAAAAAAAAEAAAAAAAAAAA --config c1.properties
$KAFKA_HOME/bin/kafka-storage.sh format --cluster-id 00000000-0000-0000-0000-000000000001 --initial-controllers 1001@localhost:10001:AAAAAAAAAAEAAAAAAAAAAA,1002@localhost:10002:AAAAAAAAAAEAAAAAAAAAAA,1003@localhost:10003:AAAAAAAAAAEAAAAAAAAAAA --config c2.properties
$KAFKA_HOME/bin/kafka-storage.sh format --cluster-id 00000000-0000-0000-0000-000000000001 --initial-controllers 1001@localhost:10001:AAAAAAAAAAEAAAAAAAAAAA,1002@localhost:10002:AAAAAAAAAAEAAAAAAAAAAA,1003@localhost:10003:AAAAAAAAAAEAAAAAAAAAAA --config c3.properties
Start the controllers, in separate terminals
$KAFKA_HOME/bin/kafka-run-class.sh -name kafkaService kafka.Kafka c1.properties
$KAFKA_HOME/bin/kafka-run-class.sh -name kafkaService kafka.Kafka c2.properties
$KAFKA_HOME/bin/kafka-run-class.sh -name kafkaService kafka.Kafka c3.properties
Remove a controller:
$KAFKA_HOME/bin/kafka-metadata-quorum.sh --bootstrap-controller localhost:10001,localhost:10002,localhost:10003,localhost:10004 remove-controller --controller-id 1001 --controller-directory-id AAAAAAAAAAEAAAAAAAAAAA
The process crashes with the following error:
[2024-10-09 15:19:15,574] ERROR Encountered fatal fault: exception while renouncing leadership (org.apache.kafka.server.fault.ProcessTerminatingFaultHandler)
java.lang.RuntimeException: Unable to reset to last stable offset 55. No in-memory snapshot found for this offset.
at org.apache.kafka.controller.OffsetControlManager.deactivate(OffsetControlManager.java:268)
at org.apache.kafka.controller.QuorumController.renounce(QuorumController.java:1281)
at org.apache.kafka.controller.QuorumController.handleEventException(QuorumController.java:552)
at org.apache.kafka.controller.QuorumController.access$800(QuorumController.java:180)
at org.apache.kafka.controller.QuorumController$ControllerWriteEvent.complete(QuorumController.java:885)
at org.apache.kafka.controller.QuorumController$ControllerWriteEvent.handleException(QuorumController.java:875)
at org.apache.kafka.queue.KafkaEventQueue$EventContext.completeWithException(KafkaEventQueue.java:153)
at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:142)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:215)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:186)
at java.base/java.lang.Thread.run(Thread.java:840)
If the process that died is restarted it joins the cluster and becomes on observer, as expected.
The crash doesn't happen in a slightly different case, exact steps missing. But the idea is this:
1. Create a 3-controller cluster as above
2. Format and start a 4rd controller.
3. Add the 4th controller as a voter.
4. Remove the 4th controller to make it an observer. It becomes observer as expected.
Because this case works, I'm guessing the crash is somehow related to the controller being one of the initial controllers.
I didn't dig deeper on why the crash occurs.