Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
3.6.0, 3.7.0, 3.6.1
-
None
Description
Description
During the ZK migration, after KRaft becomes the active controller we enter a state called hybrid mode. This means we have a mixture of ZK and KRaft brokers. The KRaft controller updates the ZK brokers using the deprecated controller RPCs (LeaderAndIsr, UpdateMetadata, etc).
A race condition exists where the KRaft controller will get stuck in a retry loop while initializing itself after a failover which prevents it from sending these RPCs to ZK brokers.
Impact
Since the KRaft controller cannot send any RPCs to the ZK brokers, the ZK brokers will not receive any metadata updates. The ZK brokers will be able to send requests to the controller (such as AlterPartitions), but the metadata updates which come as a result of those requests will never be seen. This essentially looks like the controller is unavailable from the ZK brokers perspective.
Detection and Mitigation
This bug can be seen by observing failed ZK writes from a recently elected controller.
The tell-tale error message is:
Check op on KRaft Migration ZNode failed. Expected zkVersion = 507823. This indicates that another KRaft controller is making writes to ZooKeeper.
with a stacktrace like:
java.lang.RuntimeException: Check op on KRaft Migration ZNode failed. Expected zkVersion = 507823. This indicates that another KRaft controller is making writes to ZooKeeper. at kafka.zk.KafkaZkClient.handleUnwrappedMigrationResult$1(KafkaZkClient.scala:2613) at kafka.zk.KafkaZkClient.unwrapMigrationResponse$1(KafkaZkClient.scala:2639) at kafka.zk.KafkaZkClient.$anonfun$retryMigrationRequestsUntilConnected$2(KafkaZkClient.scala:2664) at scala.collection.StrictOptimizedIterableOps.map(StrictOptimizedIterableOps.scala:100) at scala.collection.StrictOptimizedIterableOps.map$(StrictOptimizedIterableOps.scala:87) at scala.collection.mutable.ArrayBuffer.map(ArrayBuffer.scala:43) at kafka.zk.KafkaZkClient.retryMigrationRequestsUntilConnected(KafkaZkClient.scala:2664) at kafka.zk.migration.ZkTopicMigrationClient.$anonfun$createTopic$1(ZkTopicMigrationClient.scala:158) at kafka.zk.migration.ZkTopicMigrationClient.createTopic(ZkTopicMigrationClient.scala:141) at org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.lambda$handleTopicsSnapshot$27(KRaftMigrationZkWriter.java:441) at org.apache.kafka.metadata.migration.KRaftMigrationDriver.applyMigrationOperation(KRaftMigrationDriver.java:262) at org.apache.kafka.metadata.migration.KRaftMigrationDriver.access$300(KRaftMigrationDriver.java:64) at org.apache.kafka.metadata.migration.KRaftMigrationDriver$SyncKRaftMetadataEvent.lambda$run$0(KRaftMigrationDriver.java:791) at org.apache.kafka.metadata.migration.KRaftMigrationDriver.lambda$countingOperationConsumer$6(KRaftMigrationDriver.java:880) at org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.lambda$handleTopicsSnapshot$28(KRaftMigrationZkWriter.java:438) at java.base/java.lang.Iterable.forEach(Iterable.java:75) at org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.handleTopicsSnapshot(KRaftMigrationZkWriter.java:436) at org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.handleSnapshot(KRaftMigrationZkWriter.java:115) at org.apache.kafka.metadata.migration.KRaftMigrationDriver$SyncKRaftMetadataEvent.run(KRaftMigrationDriver.java:790) at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127) at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210) at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181) at java.base/java.lang.Thread.run(Thread.java:1583) at org.apache.kafka.common.utils.KafkaThread.run(KafkaThread.java:66)
To mitigate this problem, a new KRaft controller should be elected. This can be done by restarting the problematic active controller. To verify that the new controller does not encounter the race condition, look for
[KRaftMigrationDriver id=9991] 9991 transitioning from SYNC_KRAFT_TO_ZK to KRAFT_CONTROLLER_TO_BROKER_COMM state
Details
Controller A loses leadership via Raft event (e.g., from a timeout in the Raft layer). A KRaftLeaderEvent is added to KRaftMigrationDriver event queue behind any pending MetadataChangeEvents.
Controller B is elected and a KRaftLeaderEvent is added to KRaftMigrationDriver's queue. Since this controller is inactive, it processes the event immediately. This event simply loads the migration state from ZK (/migration) to check if the migration has been completed. This information is used to determine the downstream transitions in the state machine. Controller B passes through WAIT_FOR_ACTIVE_CONTROLLER and transitions to BECOME_CONTROLLER since the migration is done. While handling the BecomeZkControllerEvent, the controller forcibly takes ZK controller leadership by writing its ID into /controller and its epoch into /controller_epoch.
The change to /controller_epoch causes all of the pending writes on Controller A to fail since those writes are doing a check op on /controller_epoch as part of the multi-op writes to ZK.
However, there is a race between Controller B loading the state in /migration and when it updates /controller_epoch. It is possible for Controller A to successfully write to ZK with its older epoch. This causes the znode version of /migration to increase which will cause Controller B to get stuck.
It is safe for the old controller to be making these writes, since we only dual-write committed state from KRaft (i.e., “write-behind), but this race causes the new controller to have a stale version of /migration.
Attachments
Issue Links
- links to