[KAFKA-12798] Fixing MM2 rebalance timeout issue when source cluster is not available - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: mirrormaker, replication
Labels:
None

Description

If the network configuration of a source cluster which is taking part in a replication flow is changed (change of port number, if, for instance TLS is enabled or disabled) MirrorMaker2 won't update its internal configuration even after a reconfiguration followed by a restart.

What happens in MirrorMaker2 after a cluster "identity" (i.e. connectivity config) changes:

MM2 driver (MirrorMaker class) starts up with the new config.
DistributedHerder joins a dedicated consumer group that decides which driver instance has control over the assignments and the configuration topic.
The driver caches the consumer group assignment, which indicates that it is the leader of the group.
The driver reads the configuration topic (which is still not containing the new config), and starts the mm connectors.
Since the old config is invalid, the connectors cannot connect to the cluster anymore - MirrorSourceConnector tries to query the cluster through the admin client, but the queries time out after 2 minutes (it contains 2 tasks affecting the source cluster, the timeout is 1 minute for both).
1. In the meantime, the background heartbeat thread checks on the state of the herder consumer membership. There is a default rebalance timeout of 1 minute. Since the herder thread was blocked due to the connector query timeouts, it wasn't able to call poll on the consumer. Heartbeat thread invalidates the consumer membership and triggers a new consumer creation.
The herder thread finishes the connector startup, and after realizing that the configuration has changed, tries to update the config topic.
1. The config topic can only be updated by the leader herder.
2. The driver checks the group assignment to see if it is the leader.
3. In the local cache, the old assignment is present, in which the leader is the previous consumer with its own ID.
4. The current consumer ID of the driver does not match the cached leader ID.
The driver refuses to update the config topic.

durban, thanks for digging deeper into this issue

The proposed fix for this:
The rebalance issue can be fixed by decreasing the time that we wait for tasks that affects the source cluster at the start of MM2. By decreasing the timeout (from 1 minute to 15 seconds by default), if the kafka config is old, the tasks affecting the source cluster won't block for too long. With this the herder will be able to update the config topic. This timout is configurable now and defaults to 15 seconds.

Also needed to increase the number of threads in the scheduler so that other tasks won't be blocked.

Testing done:

configure replication between source->target
checked that the replication is working
change source kafka cluster broker port
restart kafka/mirrormaker2, produced new messages in the replicated topic
after the restart mm2 was trying to use the old kafka configs, and even after a long time, it couldn't replicate. After applying the fix, the issue was solved, replication worked.

Also tested with the same scenario, but instead of changing the port, ssl was turned on the source kafka cluster.

Attachments

Issue Links

links to

GitHub Pull Request #10712

Activity

People

Assignee:: Unassigned

Reporter:: Barnabas Maidics

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 17/May/21 13:01

Updated:: 01/Mar/22 10:26