Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
After a crash the old leader stayed as leader on disk:
$ cat ../__cluster_metadata-0/quorum-state {"leaderId":2,"leaderEpoch":77,"votedId":-1,"votedDirectoryId":"AAAAAAAAAAAAAAAAAAAAAA","data_version":1}
While the rest of the qourum move on from epoch 77:
$ bin/kafka-metadata-quorum.sh --bootstrap-controller localhost:9093,localhost:9094,localhost:9095 describe --status [2024-08-16 14:03:14,897] WARN [AdminClient clientId=adminclient-1] Connection to node -2 (localhost/127.0.0.1:9094) could not be established. Node may not be available. (org.apache.kafka.clients.NetworkClient) ClusterId: kfalSizvRGOry-gExUTS5A LeaderId: 1 LeaderEpoch: 78 HighWatermark: 98479 ...
After restarting the failed controller it looks like the state machine is notified that it is leader. This should not happen.
[2024-08-16 14:22:22,502] DEBUG [RaftManager id=2] Notifying listener org.apache.kafka.controller.QuorumController$QuorumMetaLogListener@22148336 of leader change LeaderAndEpoch(leaderId=OptionalInt[2], epoch=77) (org.apache.kafka.raft.KafkaRaftClient) [2024-08-16 14:22:22,508] INFO [controller-2-ThrottledChannelReaper-Fetch]: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper) [2024-08-16 14:22:22,508] INFO [controller-2-ThrottledChannelReaper-Produce]: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper) [2024-08-16 14:22:22,508] TRACE [RaftManager id=2] Received inbound message InboundResponse(correlationId=0, data=EndQuorumEpochResponseData(errorCode=0, topics=[TopicData(topicName='__cluster_metadata', partitions=[PartitionData(partitionIndex=0, errorCode=74, leaderId=3, leaderEpoch=79)])], nodeEndpoints=[NodeEndpoint(nodeId=3, host='localhost', port=9095)]), source=localhost:9093 (id: 1 rack: null)) (org.apache.kafka.raft.KafkaRaftClient) [2024-08-16 14:22:22,509] TRACE Writing tmp quorum state /tmp/kraft-controller-2-logs/__cluster_metadata-0/quorum-state.tmp (org.apache.kafka.raft.FileQuorumStateStore) [2024-08-16 14:22:22,510] ERROR Encountered fatal fault: exception while renouncing leadership (org.apache.kafka.server.fault.ProcessTerminatingFaultHandler) java.lang.IllegalStateException: Attempt to resign by a non-voter at org.apache.kafka.raft.KafkaRaftClient.resign(KafkaRaftClient.java:3359) at org.apache.kafka.controller.QuorumController.renounce(QuorumController.java:1263) at org.apache.kafka.controller.QuorumController.handleEventException(QuorumController.java:544) at org.apache.kafka.controller.QuorumController.access$800(QuorumController.java:179) at org.apache.kafka.controller.QuorumController$ControllerWriteEvent.complete(QuorumController.java:874) at org.apache.kafka.controller.QuorumController$ControllerWriteEvent.handleException(QuorumController.java:864) at org.apache.kafka.queue.KafkaEventQueue$EventContext.completeWithException(KafkaEventQueue.java:153) at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:142) at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:215) at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:186) at java.base/java.lang.Thread.run(Thread.java:840)
While restarting a controller that was leader after a crash the controller gets notify of leadership. This is not correct. The controller should only get notified once it has reached the high-watermark.
Attachments
Issue Links
- links to