Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
None
-
None
Description
Noticed that this test failed on Jenkins with
org.apache.kafka.server.fault.FaultHandlerException: fatalFaultHandler: exception while completing controller activation: Should not have ZK migrations enabled on a cluster running metadata.version 3.0-IV1 at app//org.apache.kafka.controller.ActivationRecordsGenerator.recordsForNonEmptyLog(ActivationRecordsGenerator.java:154) at app//org.apache.kafka.controller.ActivationRecordsGenerator.generate(ActivationRecordsGenerator.java:229) at app//org.apache.kafka.controller.QuorumController$CompleteActivationEvent.generateRecordsAndResult(QuorumController.java:1237) at app//org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:784) at app//org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127) at app//org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210) at app//org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181) at java.base@11.0.16.1/java.lang.Thread.run(Thread.java:829) Caused by: java.lang.RuntimeException: Should not have ZK migrations enabled on a cluster running metadata.version 3.0-IV1 ... 8 more
I think this exception is a red herring (it's expected from one of the negative test cases). When trying to reproduce this issue locally, I do occasionally see the following type of failure
[2023-10-19 13:42:10,091] INFO Elected new leader: LeaderAndEpoch(leaderId=OptionalInt[0], epoch=1). (org.apache.kafka.metalog.LocalLogManager$SharedLogData:300) [2023-10-19 13:42:10,091] DEBUG append(batch=LeaderChangeBatch(newLeader=LeaderAndEpoch(leaderId=OptionalInt[0], epoch=1)), nextEndOffset=0) (org.apache.kafka.metalog.LocalLogManager$SharedLogData:276) [2023-10-19 13:42:10,091] DEBUG [LocalLogManager 0] Node 0: running log check. (org.apache.kafka.metalog.LocalLogManager:536) [2023-10-19 13:42:10,091] DEBUG [LocalLogManager 0] initialized local log manager for node 0 (org.apache.kafka.metalog.LocalLogManager:685) [2023-10-19 13:42:10,091] DEBUG [QuorumController id=0] Creating in-memory snapshot -1 (org.apache.kafka.timeline.SnapshotRegistry:203) [2023-10-19 13:42:10,091] INFO [QuorumController id=0] Creating new QuorumController with clusterId 6xRUXZ_kQ1GfuaHK42AS9Q. ZK migration mode is enabled. (org.apache.kafka.controller.QuorumController:1912) [2023-10-19 13:42:10,092] INFO [LocalLogManager 0] Node 0: registered MetaLogListener 1082153924 (org.apache.kafka.metalog.LocalLogManager:703) [2023-10-19 13:42:10,092] DEBUG [LocalLogManager 0] Node 0: running log check. (org.apache.kafka.metalog.LocalLogManager:536) [2023-10-19 13:42:10,092] DEBUG [LocalLogManager 0] Node 0: Executing handleLeaderChange LeaderAndEpoch(leaderId=OptionalInt[0], epoch=1) (org.apache.kafka.metalog.LocalLogManager:578) [2023-10-19 13:42:10,092] DEBUG [QuorumController id=0] Executing handleLeaderChange[1]. (org.apache.kafka.controller.QuorumController:577) [2023-10-19 13:42:10,092] INFO [QuorumController id=0] In the new epoch 1, the leader is (none). (org.apache.kafka.controller.QuorumController:1179) [2023-10-19 13:42:10,092] DEBUG [QuorumController id=0] Processed handleLeaderChange[1] in 50 us (org.apache.kafka.controller.QuorumController:510) [2023-10-19 13:42:10,092] DEBUG [QuorumController id=0] Executing handleLeaderChange[1]. (org.apache.kafka.controller.QuorumController:577) [2023-10-19 13:42:10,092] INFO [QuorumController id=0] Becoming the active controller at epoch 1, next write offset 1. (org.apache.kafka.controller.QuorumController:1175) [2023-10-19 13:42:10,092] DEBUG [QuorumController id=0] Processed handleLeaderChange[1] in 77 us (org.apache.kafka.controller.QuorumController:510) [2023-10-19 13:42:10,092] WARN [QuorumController id=0] Performing controller activation. The metadata log appears to be empty. Appending 1 bootstrap record(s) at metadata.version 3.4-IV0 from bootstrap source 'test'. Putting the controller into pre-migration mode. No metadata updates will be allowed until the ZK metadata has been migrated. (org.apache.kafka.controller.QuorumController:108) [2023-10-19 13:42:10,092] INFO [QuorumController id=0] Replayed a FeatureLevelRecord setting metadata version to 3.4-IV0 (org.apache.kafka.controller.FeatureControlManager:400) [2023-10-19 13:42:10,092] INFO [QuorumController id=0] Replayed a ZkMigrationStateRecord changing the migration state from NONE to PRE_MIGRATION. (org.apache.kafka.controller.FeatureControlManager:421) [2023-10-19 13:42:10,092] DEBUG [QuorumController id=0] Got exception while running completeActivation[1](342832465). Invoking handleException. (org.apache.kafka.queue.KafkaEventQueue:132) org.apache.kafka.raft.errors.NotLeaderException: Append failed because the given epoch 0 is stale. Current leader epoch = 1 at org.apache.kafka.metalog.LocalLogManager$SharedLogData.tryAppend(LocalLogManager.java:253) at org.apache.kafka.metalog.LocalLogManager$SharedLogData.tryAppend(LocalLogManager.java:233) at org.apache.kafka.metalog.LocalLogManager.scheduleAtomicAppend(LocalLogManager.java:795) at org.apache.kafka.controller.QuorumController$ControllerWriteEvent$1.apply(QuorumController.java:839) at org.apache.kafka.controller.QuorumController$ControllerWriteEvent$1.apply(QuorumController.java:816) at org.apache.kafka.controller.QuorumController.appendRecords(QuorumController.java:918) at org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:815) at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127) at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210) at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181) at java.base/java.lang.Thread.run(Thread.java:833) [2023-10-19 13:42:10,092] INFO [QuorumController id=0] completeActivation[1]: event failed with NotLeaderException (treated as NotControllerException) at epoch 1 in 387 microseconds. Renouncing leadership and reverting to the last committed offset -1. (org.apache.kafka.controller.QuorumController:542) [2023-10-19 13:42:10,092] ERROR Encountered fatalFaultHandler fault: exception while renouncing leadership (org.apache.kafka.server.fault.MockFaultHandler:44) java.lang.IllegalArgumentException: Attempt to resign from epoch 1 which is larger than the current epoch 0 at org.apache.kafka.metalog.LocalLogManager.resign(LocalLogManager.java:808) at org.apache.kafka.controller.QuorumController.renounce(QuorumController.java:1268) at org.apache.kafka.controller.QuorumController.handleEventException(QuorumController.java:545) at org.apache.kafka.controller.QuorumController.access$800(QuorumController.java:177) at org.apache.kafka.controller.QuorumController$ControllerWriteEvent.complete(QuorumController.java:879) at org.apache.kafka.controller.QuorumController$ControllerWriteEvent.handleException(QuorumController.java:869) at org.apache.kafka.queue.KafkaEventQueue$EventContext.completeWithException(KafkaEventQueue.java:148) at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:137) at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210) at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181) at java.base/java.lang.Thread.run(Thread.java:833) [2023-10-19 13:42:10,093] DEBUG [QuorumController id=0] Processed read migration state(176059074) in 4 us (org.apache.kafka.controller.QuorumController:510) [2023-10-19 13:42:10,093] INFO [QuorumController id=0] QuorumController#beginShutdown: shutting down event queue. (org.apache.kafka.queue.KafkaEventQueue:492) [2023-10-19 13:42:10,093] DEBUG [QuorumController id=0] KafkaEventQueue#close: Event queue is already shutting down. (org.apache.kafka.queue.KafkaEventQueue:489) [2023-10-19 13:42:10,093] INFO [QuorumController id=0] closed event queue. (org.apache.kafka.queue.KafkaEventQueue:514) [2023-10-19 13:42:10,093] INFO [LocalLogManager 0] beginShutdown: shutting down event queue. (org.apache.kafka.queue.KafkaEventQueue:492) [2023-10-19 13:42:10,093] DEBUG [LocalLogManager 0] Node 0: closing. (org.apache.kafka.metalog.LocalLogManager:653) [2023-10-19 13:42:10,094] DEBUG [LocalLogManager 0] beginShutdown: Event queue is already shutting down. (org.apache.kafka.queue.KafkaEventQueue:489) [2023-10-19 13:42:10,094] DEBUG [LocalLogManager 0] KafkaEventQueue#close: Event queue is already shutting down. (org.apache.kafka.queue.KafkaEventQueue:489) [2023-10-19 13:42:10,094] DEBUG [LocalLogManager 0] Node 0: beginning shutdown. (org.apache.kafka.metalog.LocalLogManager:637) [2023-10-19 13:42:10,094] DEBUG append(batch=LeaderChangeBatch(newLeader=LeaderAndEpoch(leaderId=OptionalInt.empty, epoch=2)), nextEndOffset=1) (org.apache.kafka.metalog.LocalLogManager$SharedLogData:276) [2023-10-19 13:42:10,094] INFO Elected new leader: LeaderAndEpoch(leaderId=OptionalInt[0], epoch=3). (org.apache.kafka.metalog.LocalLogManager$SharedLogData:300) [2023-10-19 13:42:10,094] DEBUG append(batch=LeaderChangeBatch(newLeader=LeaderAndEpoch(leaderId=OptionalInt[0], epoch=3)), nextEndOffset=2) (org.apache.kafka.metalog.LocalLogManager$SharedLogData:276) [2023-10-19 13:42:10,094] DEBUG [QuorumController id=0] MetaLogManager.Listener: Event queue is already shutting down. (org.apache.kafka.queue.KafkaEventQueue:489) [2023-10-19 13:42:10,094] INFO [LocalLogManager 0] closed event queue. (org.apache.kafka.queue.KafkaEventQueue:514) org.apache.kafka.server.fault.FaultHandlerException: fatalFaultHandler: exception while renouncing leadership: Attempt to resign from epoch 1 which is larger than the current epoch 0 at org.apache.kafka.metalog.LocalLogManager.resign(LocalLogManager.java:808) at org.apache.kafka.controller.QuorumController.renounce(QuorumController.java:1268) at org.apache.kafka.controller.QuorumController.handleEventException(QuorumController.java:545) at org.apache.kafka.controller.QuorumController.access$800(QuorumController.java:177) at org.apache.kafka.controller.QuorumController$ControllerWriteEvent.complete(QuorumController.java:879) at org.apache.kafka.controller.QuorumController$ControllerWriteEvent.handleException(QuorumController.java:869) at org.apache.kafka.queue.KafkaEventQueue$EventContext.completeWithException(KafkaEventQueue.java:148) at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:137) at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210) at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181) at java.base/java.lang.Thread.run(Thread.java:833) Caused by: java.lang.IllegalArgumentException: Attempt to resign from epoch 1 which is larger than the current epoch 0 ... 11 more