Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-10296

Connector task reported RUNNING after hard bounce of worker

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.3.1, 2.5.0, 2.4.1
    • None
    • connect
    • None

    Description

      While fixing flakiness for ConnectDistributedTest.test_bounce inĀ KAFKA-10295, I observed that the status of connectors on zombie/offline workers was inconsistent during Incremental Cooperative Rebalancing's scheduled.delay.max.interval.ms. This is the reproduction case observed there:

      1. A task is running on worker A, with another worker B in the same distributed cluster
      2. Observe that on worker B's REST API, the task is initially correctly "RUNNING"
      3. Worker A is hard-stopped, and goes offline
      4. Observe that on worker B's REST API, the task is still "RUNNING"
      5. The group rebalances without worker A in the group, and begins the delay
      6. Observe that on worker B's REST API, the task is still "RUNNING"
      7. Worker A recovers and joins the group, before the delay expires
      8. Observe that on worker B's REST API, the task is still "RUNNING"
      9. The rebalance delay expires, and the task is assigned and started
      10. Observe that on worker B's REST API, the task is now correctly "RUNNING"
      • In the first state (4), after the worker goes offline, but before the other workers learn that they have gone offline, it is acceptable that the task is still reported as running. We can't expect that the other workers know that worker A has gone offline until the group membership protocol informs them.
      • In the second state (6), when a rebalance occurs and the worker is first known to be unhealthy, the state of the task is ambiguous, since it may be down completely, or running on a zombie worker. I'm not sure how best to capture this state under the existing enum's options, but it's probably closest to "UNASSIGNED" since the leader doesn't think that any worker is currently running that task.
      • In the third state (8), when the bounced worker returns, the task is reported RUNNING on a worker which does not have the task assigned. This is the most inaccurate state reported, since the cluster has reached consensus, and yet the REST API still reports the wrong state of the task.

      State (6) could be assigned a new state "UNKNOWN", introduced by a KIP. However, this is a large investment in process and time for what amounts to an almost cosmetic change, and we could either leave this state as "RUNNING", or change it to "UNASSIGNED"
      StateĀ (8) could be described by "UNASSIGNED", and would be a tangible improvement for test_bounce, which currently needs to intentionally ignore the REST API's result here because it is inaccurate.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              gharris1727 Greg Harris
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: