[KAFKA-17146] ZK to KRAFT migration stuck in pre-migration mode - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Blocker
Resolution: Unresolved
Affects Version/s: 3.7.0, 3.7.1
Fix Version/s: None
Component/s: controller, kraft, migration
Labels:
Environment:
Virtual machines isolated: 3 VMs with Kafka brokers + 3 Zookeeper/KRAFT

Description

I'm facing a migration from Zookeeper to KRAFT with Kafka 3.7.1 cluster. (EDIT: the same issue happens for version 3.7.0)

I'm using this configuration to allow SSL everywhere and, SCRAM authentication only for brokers and PLAIN authentication for controllers

listener.security.protocol.map=EXTERNAL_SASL:SASL_SSL,CONTROLLER:SASL_SSL


inter.broker.listener.name=EXTERNAL_SASL
sasl.enabled.mechanisms=SCRAM-SHA-512,PLAIN
sasl.mechanism=SCRAM-SHA-512
sasl.mechanism.controller.protocol=PLAIN
sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512

The cluster has 3 brokers and 3 zookeeper nodes initially then a controllers cluster with 3 KRAFT controllers is configured and running in parallel as per documentation for the migration process.
I’ve started the migration with 3 controllers enrolled with SASL_SSL with PLAIN authentication and I already have a strange TRACE log:

TRACE [KRaftMigrationDriver id=3000] Received metadata delta, but the controller is not in dual-write mode. Ignoring the change to be replicated to Zookeeper (org.apache.kafka.metadata.migration.KRaftMigrationDriver)

With later this message where KRAFT is waiting to brokers to connect

INFO [KRaftMigrationDriver id=1000] No brokers are known to KRaft, waiting for brokers to register. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)

As soon I start to reconfigure the brokers letting them to connect to the new controllers, all good in the KRAFT controllers with notifications that the KRAFT brokers were connecting correctly connected and enrolled

INFO [QuorumController id=1000] Replayed initial RegisterBrokerRecord for broker 1: RegisterBrokerRecord(brokerId=1, isMigratingZkBroker=true, incarnationId=xxxxxx, brokerEpoch=2638, endPoints=[BrokerEndpoint(name='EXTERNAL_SASL', host='vmk-tdtkafka-01', port=9095, securityProtocol=3)], features=[BrokerFeature(name='metadata.version', minSupportedVersion=19, maxSupportedVersion=19)], rack='zur1', fenced=true, inControlledShutdown=false, logDirs=[xxxxxx]) (org.apache.kafka.controller.ClusterControlManager)
[...]
INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2, 3] to register with KRaft. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
[...]
INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2] to register with KRaft. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)

As soon the first broker is connected we start to get these info logs related to the migration process in the controller:

INFO [QuorumController id=1000] Cannot run write operation maybeFenceReplicas in pre-migration mode. Returning NOT_CONTROLLER. (org.apache.kafka.controller.QuorumController)
INFO [QuorumController id=1000] maybeFenceReplicas: event failed with NotControllerException in 355 microseconds. Exception message: The controller is in pre-migration mode. (org.apache.kafka.controller.QuorumController)

but as well requests to autocreate topics that exist already, in loop every 30seconds, in the last broker restarted:

INFO Sent auto-creation request for Set(_schemas) to the active controller. (kafka.server.DefaultAutoTopicCreationManager)
INFO Sent auto-creation request for Set(_schemas) to the active controller. (kafka.server.DefaultAutoTopicCreationManager)
INFO Sent auto-creation request for Set(_schemas) to the active controller. (kafka.server.DefaultAutoTopicCreationManager)

Up to the moment we have a controller still in the old cluster (in the kafka brokers) everything runs fine. As soon the last node is restarted the things are going out of the rail. This last node never gets any partition assigned and the cluster stays forever in with under replicated partitions. This is the log from the last node register that should start the migration mode, but the cluster stays forever in SYNC_KRAFT_TO_ZK state in pre-migration mode.

INFO [QuorumController id=1000] The request from broker 2 to unfence has been granted because it has caught up with the offset of its register broker record 4101
[...]
INFO [KRaftMigrationDriver id=1000] Ignoring image MetadataProvenance(lastContainedOffset=4127, lastContainedEpoch=5, lastContainedLogTimeMs=1721133091831) which does not contain a superset of the metadata in ZK. Staying in SYNC_KRAFT_TO_ZK until a newer image is loaded (org.apache.kafka.metadata.migration.KRaftMigrationDriver)

The only way to recover the cluster is revert everything stopping clusters, removing /controller from zookeeper and restore the Zookeeper only configuration in the brokers. A cleanup of the controller is necessary too.

The migration never starts and the controllers never understand that they have to migrate the data from Zookeeper. More than that, the new controller claims to be the CONTROLLER but it refuses to be it.

ZK to KRAFT migration stuck in pre-migration mode

Details

Description

Attachments

Issue Links

Activity

People

Dates