Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-18007

MirrorCheckpointConnector fails with “Timeout while loading consumer groups” after upgrading to Kafka 3.9.0

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.9.0
    • None
    • mirrormaker
    • None

    Description

      After upgrading our Kafka clusters from version 3.6.0 to 3.9.0, we started experiencing repeated errors with the MirrorCheckpointConnector in MirrorMaker 2. The connector fails with a RetriableException stating “Timeout while loading consumer groups.” This issue persists despite several attempts to resolve it.
      Error Message:

      Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: [2024-11-11 12:21:57,342] ERROR [Worker clientId=analytics-dev->app-dev, groupId=analytics-dev-mm2] Failed to reconfigure connector's tasks (MirrorCheckpointConnector), retrying after backoff. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:2195)
      Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: org.apache.kafka.connect.errors.RetriableException: Timeout while loading consumer groups.
      Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]:         at org.apache.kafka.connect.mirror.MirrorCheckpointConnector.taskConfigs(MirrorCheckpointConnector.java:138)
      Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]:         at org.apache.kafka.connect.runtime.Worker.connectorTaskConfigs(Worker.java:398)
      Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]:         at org.apache.kafka.connect.runtime.distributed.DistributedHerder.reconfigureConnector(DistributedHerder.java:2243)
      Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]:         at org.apache.kafka.connect.runtime.distributed.DistributedHerder.reconfigureConnectorTasksWithExponentialBackoffRetries(DistributedHerder.java:2183)
      Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]:         at org.apache.kafka.connect.runtime.distributed.DistributedHerder.lambda$null$47(DistributedHerder.java:2199)
      Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]:         at org.apache.kafka.connect.runtime.distributed.DistributedHerder.runRequest(DistributedHerder.java:2402)
      Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]:         at org.apache.kafka.connect.runtime.distributed.DistributedHerder.tick(DistributedHerder.java:498)
      Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]:         at org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:383)
      Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]:         at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
      Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]:         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]:         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
      Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]:         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
      Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]:         at java.base/java.lang.Thread.run(Thread.java:840)

      Steps to Reproduce:
      1. Upgrade Kafka clusters sequentially from 3.6.0 to 3.9.0.
      2. Configure MirrorMaker 2 to mirror topics from clusters A and B to cluster C.
      3. Start MirrorMaker 2.
      4. Observe the logs for the MirrorCheckpointConnector.

      What We Tried:
      Checked ACLs and Authentication:

      • Ensured that the mirror_maker user has the necessary permissions and can authenticate successfully.
      • Verified that we could list consumer groups using kafka-consumer-groups.sh with the mirror_maker user.

      Increased Timeouts:

      • Increased admin.timeout.ms to 300000 (5 minutes) and even higher values.
      • Adjusted admin.request.timeout.ms and admin.retry.backoff.ms accordingly.

      Enabled Detailed Logging:

      • Set the logging level to DEBUG for org.apache.kafka.connect.mirror to gain more insights.
      • No additional information that could help resolve the issue was found.

      Temporary Workarounds:

      • Disabled emit.checkpoints.enabled and sync.group.offsets.enabled to prevent the MirrorCheckpointConnector from running.
      • This is not a viable long-term solution as we need to synchronize consumer group offsets.

      Resolution:
      Rolled Back to Kafka 3.8.1:

      • As a test, we downgraded our Kafka clusters back to version 3.8.1.
      • After the downgrade, the error disappeared, and the MirrorCheckpointConnector functioned correctly.
      • This suggests that the issue was introduced in version 3.9.0.

      Analysis:
      Possible Relation to KAFKA-17232:

      • We found the JIRA issue KAFKA-17232 titled “MirrorCheckpointConnector does not generate task configs if initial consumer group load times out.”
      • It appears that changes introduced in Kafka 3.9.0 related to this issue may have inadvertently caused our problem.
      • However, our clusters are not particularly large, and the initial consumer group load should not exceed the timeouts.

      Request:
      Assistance in Resolving the Issue:

      • Is there a known workaround or configuration change that can prevent this error in Kafka 3.9.0?
      • Could the changes made in KAFKA-17232 have unintentionally caused this problem?
      • Are there plans to address this issue in an upcoming release?

      Guidance on Next Steps:

      • Should we avoid upgrading to versions beyond 3.8.1 until this issue is resolved?
      • Is it advisable to apply any patches or pull requests manually?

      Thank you for your attention to this matter. Please let me know if I can provide any additional information to help resolve this issue.

      Best regards,
      Asker Kakhramanov

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              kakhramanov Asker
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: