[KAFKA-10296] Connector task reported RUNNING after hard bounce of worker - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 2.3.1, 2.5.0, 2.4.1
Fix Version/s: None
Component/s: connect
Labels:
None

Description

While fixing flakiness for ConnectDistributedTest.test_bounce in ~~KAFKA-10295~~, I observed that the status of connectors on zombie/offline workers was inconsistent during Incremental Cooperative Rebalancing's scheduled.delay.max.interval.ms. This is the reproduction case observed there:

A task is running on worker A, with another worker B in the same distributed cluster
Observe that on worker B's REST API, the task is initially correctly "RUNNING"
Worker A is hard-stopped, and goes offline
Observe that on worker B's REST API, the task is still "RUNNING"
The group rebalances without worker A in the group, and begins the delay
Observe that on worker B's REST API, the task is still "RUNNING"
Worker A recovers and joins the group, before the delay expires
Observe that on worker B's REST API, the task is still "RUNNING"
The rebalance delay expires, and the task is assigned and started
Observe that on worker B's REST API, the task is now correctly "RUNNING"

In the first state (4), after the worker goes offline, but before the other workers learn that they have gone offline, it is acceptable that the task is still reported as running. We can't expect that the other workers know that worker A has gone offline until the group membership protocol informs them.
In the second state (6), when a rebalance occurs and the worker is first known to be unhealthy, the state of the task is ambiguous, since it may be down completely, or running on a zombie worker. I'm not sure how best to capture this state under the existing enum's options, but it's probably closest to "UNASSIGNED" since the leader doesn't think that any worker is currently running that task.
In the third state (8), when the bounced worker returns, the task is reported RUNNING on a worker which does not have the task assigned. This is the most inaccurate state reported, since the cluster has reached consensus, and yet the REST API still reports the wrong state of the task.

State (6) could be assigned a new state "UNKNOWN", introduced by a KIP. However, this is a large investment in process and time for what amounts to an almost cosmetic change, and we could either leave this state as "RUNNING", or change it to "UNASSIGNED"
State (8) could be described by "UNASSIGNED", and would be a tangible improvement for test_bounce, which currently needs to intentionally ignore the REST API's result here because it is inaccurate.

Attachments

Issue Links

is related to

KAFKA-10295 ConnectDistributedTest.test_bounce should wait for graceful stop

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Greg Harris

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 20/Jul/20 06:39

Updated:: 20/Jul/20 06:40