Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-15648

QuorumControllerTest#testBootstrapZkMigrationRecord is flaky

    XMLWordPrintableJSON

Details

    Description

      Noticed that this test failed on Jenkins with

      org.apache.kafka.server.fault.FaultHandlerException: fatalFaultHandler: exception while completing controller activation: Should not have ZK migrations enabled on a cluster running metadata.version 3.0-IV1
      	at app//org.apache.kafka.controller.ActivationRecordsGenerator.recordsForNonEmptyLog(ActivationRecordsGenerator.java:154)
      	at app//org.apache.kafka.controller.ActivationRecordsGenerator.generate(ActivationRecordsGenerator.java:229)
      	at app//org.apache.kafka.controller.QuorumController$CompleteActivationEvent.generateRecordsAndResult(QuorumController.java:1237)
      	at app//org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:784)
      	at app//org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)
      	at app//org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)
      	at app//org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)
      	at java.base@11.0.16.1/java.lang.Thread.run(Thread.java:829)
      Caused by: java.lang.RuntimeException: Should not have ZK migrations enabled on a cluster running metadata.version 3.0-IV1
      	... 8 more
      

      I think this exception is a red herring (it's expected from one of the negative test cases). When trying to reproduce this issue locally, I do occasionally see the following type of failure

      [2023-10-19 13:42:10,091] INFO Elected new leader: LeaderAndEpoch(leaderId=OptionalInt[0], epoch=1). (org.apache.kafka.metalog.LocalLogManager$SharedLogData:300)
      [2023-10-19 13:42:10,091] DEBUG append(batch=LeaderChangeBatch(newLeader=LeaderAndEpoch(leaderId=OptionalInt[0], epoch=1)), nextEndOffset=0) (org.apache.kafka.metalog.LocalLogManager$SharedLogData:276)
      [2023-10-19 13:42:10,091] DEBUG [LocalLogManager 0] Node 0: running log check. (org.apache.kafka.metalog.LocalLogManager:536)
      [2023-10-19 13:42:10,091] DEBUG [LocalLogManager 0] initialized local log manager for node 0 (org.apache.kafka.metalog.LocalLogManager:685)
      [2023-10-19 13:42:10,091] DEBUG [QuorumController id=0] Creating in-memory snapshot -1 (org.apache.kafka.timeline.SnapshotRegistry:203)
      [2023-10-19 13:42:10,091] INFO [QuorumController id=0] Creating new QuorumController with clusterId 6xRUXZ_kQ1GfuaHK42AS9Q. ZK migration mode is enabled. (org.apache.kafka.controller.QuorumController:1912)
      [2023-10-19 13:42:10,092] INFO [LocalLogManager 0] Node 0: registered MetaLogListener 1082153924 (org.apache.kafka.metalog.LocalLogManager:703)
      [2023-10-19 13:42:10,092] DEBUG [LocalLogManager 0] Node 0: running log check. (org.apache.kafka.metalog.LocalLogManager:536)
      [2023-10-19 13:42:10,092] DEBUG [LocalLogManager 0] Node 0: Executing handleLeaderChange LeaderAndEpoch(leaderId=OptionalInt[0], epoch=1) (org.apache.kafka.metalog.LocalLogManager:578)
      [2023-10-19 13:42:10,092] DEBUG [QuorumController id=0] Executing handleLeaderChange[1]. (org.apache.kafka.controller.QuorumController:577)
      [2023-10-19 13:42:10,092] INFO [QuorumController id=0] In the new epoch 1, the leader is (none). (org.apache.kafka.controller.QuorumController:1179)
      [2023-10-19 13:42:10,092] DEBUG [QuorumController id=0] Processed handleLeaderChange[1] in 50 us (org.apache.kafka.controller.QuorumController:510)
      [2023-10-19 13:42:10,092] DEBUG [QuorumController id=0] Executing handleLeaderChange[1]. (org.apache.kafka.controller.QuorumController:577)
      [2023-10-19 13:42:10,092] INFO [QuorumController id=0] Becoming the active controller at epoch 1, next write offset 1. (org.apache.kafka.controller.QuorumController:1175)
      [2023-10-19 13:42:10,092] DEBUG [QuorumController id=0] Processed handleLeaderChange[1] in 77 us (org.apache.kafka.controller.QuorumController:510)
      [2023-10-19 13:42:10,092] WARN [QuorumController id=0] Performing controller activation. The metadata log appears to be empty. Appending 1 bootstrap record(s) at metadata.version 3.4-IV0 from bootstrap source 'test'. Putting the controller into pre-migration mode. No metadata updates will be allowed until the ZK metadata has been migrated. (org.apache.kafka.controller.QuorumController:108)
      [2023-10-19 13:42:10,092] INFO [QuorumController id=0] Replayed a FeatureLevelRecord setting metadata version to 3.4-IV0 (org.apache.kafka.controller.FeatureControlManager:400)
      [2023-10-19 13:42:10,092] INFO [QuorumController id=0] Replayed a ZkMigrationStateRecord changing the migration state from NONE to PRE_MIGRATION. (org.apache.kafka.controller.FeatureControlManager:421)
      [2023-10-19 13:42:10,092] DEBUG [QuorumController id=0] Got exception while running completeActivation[1](342832465). Invoking handleException. (org.apache.kafka.queue.KafkaEventQueue:132)
      org.apache.kafka.raft.errors.NotLeaderException: Append failed because the given epoch 0 is stale. Current leader epoch = 1
      	at org.apache.kafka.metalog.LocalLogManager$SharedLogData.tryAppend(LocalLogManager.java:253)
      	at org.apache.kafka.metalog.LocalLogManager$SharedLogData.tryAppend(LocalLogManager.java:233)
      	at org.apache.kafka.metalog.LocalLogManager.scheduleAtomicAppend(LocalLogManager.java:795)
      	at org.apache.kafka.controller.QuorumController$ControllerWriteEvent$1.apply(QuorumController.java:839)
      	at org.apache.kafka.controller.QuorumController$ControllerWriteEvent$1.apply(QuorumController.java:816)
      	at org.apache.kafka.controller.QuorumController.appendRecords(QuorumController.java:918)
      	at org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:815)
      	at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)
      	at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)
      	at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)
      	at java.base/java.lang.Thread.run(Thread.java:833)
      [2023-10-19 13:42:10,092] INFO [QuorumController id=0] completeActivation[1]: event failed with NotLeaderException (treated as NotControllerException) at epoch 1 in 387 microseconds. Renouncing leadership and reverting to the last committed offset -1. (org.apache.kafka.controller.QuorumController:542)
      [2023-10-19 13:42:10,092] ERROR Encountered fatalFaultHandler fault: exception while renouncing leadership (org.apache.kafka.server.fault.MockFaultHandler:44)
      java.lang.IllegalArgumentException: Attempt to resign from epoch 1 which is larger than the current epoch 0
      	at org.apache.kafka.metalog.LocalLogManager.resign(LocalLogManager.java:808)
      	at org.apache.kafka.controller.QuorumController.renounce(QuorumController.java:1268)
      	at org.apache.kafka.controller.QuorumController.handleEventException(QuorumController.java:545)
      	at org.apache.kafka.controller.QuorumController.access$800(QuorumController.java:177)
      	at org.apache.kafka.controller.QuorumController$ControllerWriteEvent.complete(QuorumController.java:879)
      	at org.apache.kafka.controller.QuorumController$ControllerWriteEvent.handleException(QuorumController.java:869)
      	at org.apache.kafka.queue.KafkaEventQueue$EventContext.completeWithException(KafkaEventQueue.java:148)
      	at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:137)
      	at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)
      	at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)
      	at java.base/java.lang.Thread.run(Thread.java:833)
      [2023-10-19 13:42:10,093] DEBUG [QuorumController id=0] Processed read migration state(176059074) in 4 us (org.apache.kafka.controller.QuorumController:510)
      [2023-10-19 13:42:10,093] INFO [QuorumController id=0] QuorumController#beginShutdown: shutting down event queue. (org.apache.kafka.queue.KafkaEventQueue:492)
      [2023-10-19 13:42:10,093] DEBUG [QuorumController id=0] KafkaEventQueue#close: Event queue is already shutting down. (org.apache.kafka.queue.KafkaEventQueue:489)
      [2023-10-19 13:42:10,093] INFO [QuorumController id=0] closed event queue. (org.apache.kafka.queue.KafkaEventQueue:514)
      [2023-10-19 13:42:10,093] INFO [LocalLogManager 0] beginShutdown: shutting down event queue. (org.apache.kafka.queue.KafkaEventQueue:492)
      [2023-10-19 13:42:10,093] DEBUG [LocalLogManager 0] Node 0: closing. (org.apache.kafka.metalog.LocalLogManager:653)
      [2023-10-19 13:42:10,094] DEBUG [LocalLogManager 0] beginShutdown: Event queue is already shutting down. (org.apache.kafka.queue.KafkaEventQueue:489)
      [2023-10-19 13:42:10,094] DEBUG [LocalLogManager 0] KafkaEventQueue#close: Event queue is already shutting down. (org.apache.kafka.queue.KafkaEventQueue:489)
      [2023-10-19 13:42:10,094] DEBUG [LocalLogManager 0] Node 0: beginning shutdown. (org.apache.kafka.metalog.LocalLogManager:637)
      [2023-10-19 13:42:10,094] DEBUG append(batch=LeaderChangeBatch(newLeader=LeaderAndEpoch(leaderId=OptionalInt.empty, epoch=2)), nextEndOffset=1) (org.apache.kafka.metalog.LocalLogManager$SharedLogData:276)
      [2023-10-19 13:42:10,094] INFO Elected new leader: LeaderAndEpoch(leaderId=OptionalInt[0], epoch=3). (org.apache.kafka.metalog.LocalLogManager$SharedLogData:300)
      [2023-10-19 13:42:10,094] DEBUG append(batch=LeaderChangeBatch(newLeader=LeaderAndEpoch(leaderId=OptionalInt[0], epoch=3)), nextEndOffset=2) (org.apache.kafka.metalog.LocalLogManager$SharedLogData:276)
      [2023-10-19 13:42:10,094] DEBUG [QuorumController id=0] MetaLogManager.Listener: Event queue is already shutting down. (org.apache.kafka.queue.KafkaEventQueue:489)
      [2023-10-19 13:42:10,094] INFO [LocalLogManager 0] closed event queue. (org.apache.kafka.queue.KafkaEventQueue:514)
      
      org.apache.kafka.server.fault.FaultHandlerException: fatalFaultHandler: exception while renouncing leadership: Attempt to resign from epoch 1 which is larger than the current epoch 0
      
      	at org.apache.kafka.metalog.LocalLogManager.resign(LocalLogManager.java:808)
      	at org.apache.kafka.controller.QuorumController.renounce(QuorumController.java:1268)
      	at org.apache.kafka.controller.QuorumController.handleEventException(QuorumController.java:545)
      	at org.apache.kafka.controller.QuorumController.access$800(QuorumController.java:177)
      	at org.apache.kafka.controller.QuorumController$ControllerWriteEvent.complete(QuorumController.java:879)
      	at org.apache.kafka.controller.QuorumController$ControllerWriteEvent.handleException(QuorumController.java:869)
      	at org.apache.kafka.queue.KafkaEventQueue$EventContext.completeWithException(KafkaEventQueue.java:148)
      	at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:137)
      	at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)
      	at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)
      	at java.base/java.lang.Thread.run(Thread.java:833)
      Caused by: java.lang.IllegalArgumentException: Attempt to resign from epoch 1 which is larger than the current epoch 0
      	... 11 more
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            davidarthur David Arthur
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: