Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
2.3.1, 2.5.0, 2.4.1
-
None
-
None
Description
While fixing flakiness for ConnectDistributedTest.test_bounce inĀ KAFKA-10295, I observed that the status of connectors on zombie/offline workers was inconsistent during Incremental Cooperative Rebalancing's scheduled.delay.max.interval.ms. This is the reproduction case observed there:
- A task is running on worker A, with another worker B in the same distributed cluster
- Observe that on worker B's REST API, the task is initially correctly "RUNNING"
- Worker A is hard-stopped, and goes offline
- Observe that on worker B's REST API, the task is still "RUNNING"
- The group rebalances without worker A in the group, and begins the delay
- Observe that on worker B's REST API, the task is still "RUNNING"
- Worker A recovers and joins the group, before the delay expires
- Observe that on worker B's REST API, the task is still "RUNNING"
- The rebalance delay expires, and the task is assigned and started
- Observe that on worker B's REST API, the task is now correctly "RUNNING"
- In the first state (4), after the worker goes offline, but before the other workers learn that they have gone offline, it is acceptable that the task is still reported as running. We can't expect that the other workers know that worker A has gone offline until the group membership protocol informs them.
- In the second state (6), when a rebalance occurs and the worker is first known to be unhealthy, the state of the task is ambiguous, since it may be down completely, or running on a zombie worker. I'm not sure how best to capture this state under the existing enum's options, but it's probably closest to "UNASSIGNED" since the leader doesn't think that any worker is currently running that task.
- In the third state (8), when the bounced worker returns, the task is reported RUNNING on a worker which does not have the task assigned. This is the most inaccurate state reported, since the cluster has reached consensus, and yet the REST API still reports the wrong state of the task.
State (6) could be assigned a new state "UNKNOWN", introduced by a KIP. However, this is a large investment in process and time for what amounts to an almost cosmetic change, and we could either leave this state as "RUNNING", or change it to "UNASSIGNED"
StateĀ (8) could be described by "UNASSIGNED", and would be a tangible improvement for test_bounce, which currently needs to intentionally ignore the REST API's result here because it is inaccurate.
Attachments
Issue Links
- is related to
-
KAFKA-10295 ConnectDistributedTest.test_bounce should wait for graceful stop
- Resolved