Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.9.0
-
None
-
None
-
- Kafka Version: Upgraded sequentially from 3.6.0 to 3.9.0
- Clusters: Three clusters named A, B, and C
- Clusters A and B mirror topics to cluster C using MirrorMaker 2
- Number of Consumer Groups: Approximately 200
- Number of Topics: Approximately 2000
- Operating System: Ubuntu 20.04.5 LTS (GNU/Linux 5.4.0-135-generic x86_64)- Kafka Version: Upgraded sequentially from 3.6.0 to 3.9.0 - Clusters: Three clusters named A, B, and C - Clusters A and B mirror topics to cluster C using MirrorMaker 2 - Number of Consumer Groups: Approximately 200 - Number of Topics: Approximately 2000 - Operating System: Ubuntu 20.04.5 LTS (GNU/Linux 5.4.0-135-generic x86_64)
Description
After upgrading our Kafka clusters from version 3.6.0 to 3.9.0, we started experiencing repeated errors with the MirrorCheckpointConnector in MirrorMaker 2. The connector fails with a RetriableException stating “Timeout while loading consumer groups.” This issue persists despite several attempts to resolve it.
Error Message:
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: [2024-11-11 12:21:57,342] ERROR [Worker clientId=analytics-dev->app-dev, groupId=analytics-dev-mm2] Failed to reconfigure connector's tasks (MirrorCheckpointConnector), retrying after backoff. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:2195) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: org.apache.kafka.connect.errors.RetriableException: Timeout while loading consumer groups. Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at org.apache.kafka.connect.mirror.MirrorCheckpointConnector.taskConfigs(MirrorCheckpointConnector.java:138) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at org.apache.kafka.connect.runtime.Worker.connectorTaskConfigs(Worker.java:398) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at org.apache.kafka.connect.runtime.distributed.DistributedHerder.reconfigureConnector(DistributedHerder.java:2243) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at org.apache.kafka.connect.runtime.distributed.DistributedHerder.reconfigureConnectorTasksWithExponentialBackoffRetries(DistributedHerder.java:2183) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at org.apache.kafka.connect.runtime.distributed.DistributedHerder.lambda$null$47(DistributedHerder.java:2199) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at org.apache.kafka.connect.runtime.distributed.DistributedHerder.runRequest(DistributedHerder.java:2402) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at org.apache.kafka.connect.runtime.distributed.DistributedHerder.tick(DistributedHerder.java:498) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:383) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at java.base/java.lang.Thread.run(Thread.java:840)
Steps to Reproduce:
1. Upgrade Kafka clusters sequentially from 3.6.0 to 3.9.0.
2. Configure MirrorMaker 2 to mirror topics from clusters A and B to cluster C.
3. Start MirrorMaker 2.
4. Observe the logs for the MirrorCheckpointConnector.
What We Tried:
Checked ACLs and Authentication:
- Ensured that the mirror_maker user has the necessary permissions and can authenticate successfully.
- Verified that we could list consumer groups using kafka-consumer-groups.sh with the mirror_maker user.
Increased Timeouts:
- Increased admin.timeout.ms to 300000 (5 minutes) and even higher values.
- Adjusted admin.request.timeout.ms and admin.retry.backoff.ms accordingly.
Enabled Detailed Logging:
- Set the logging level to DEBUG for org.apache.kafka.connect.mirror to gain more insights.
- No additional information that could help resolve the issue was found.
Temporary Workarounds:
- Disabled emit.checkpoints.enabled and sync.group.offsets.enabled to prevent the MirrorCheckpointConnector from running.
- This is not a viable long-term solution as we need to synchronize consumer group offsets.
Resolution:
Rolled Back to Kafka 3.8.1:
- As a test, we downgraded our Kafka clusters back to version 3.8.1.
- After the downgrade, the error disappeared, and the MirrorCheckpointConnector functioned correctly.
- This suggests that the issue was introduced in version 3.9.0.
Analysis:
Possible Relation to KAFKA-17232:
- We found the JIRA issue
KAFKA-17232titled “MirrorCheckpointConnector does not generate task configs if initial consumer group load times out.” - It appears that changes introduced in Kafka 3.9.0 related to this issue may have inadvertently caused our problem.
- However, our clusters are not particularly large, and the initial consumer group load should not exceed the timeouts.
Request:
Assistance in Resolving the Issue:
- Is there a known workaround or configuration change that can prevent this error in Kafka 3.9.0?
- Could the changes made in
KAFKA-17232have unintentionally caused this problem? - Are there plans to address this issue in an upcoming release?
Guidance on Next Steps:
- Should we avoid upgrading to versions beyond 3.8.1 until this issue is resolved?
- Is it advisable to apply any patches or pull requests manually?
Thank you for your attention to this matter. Please let me know if I can provide any additional information to help resolve this issue.
Best regards,
Asker Kakhramanov
Attachments
Issue Links
- is caused by
-
KAFKA-18021 Disabled MirrorCheckpointConnector throws RetriableException on task config generation
- Open