Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-12525

Inaccurate task status due to status record interleaving in fast rebalances in Connect

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.1, 2.4.1, 2.5.1, 2.7.0, 2.6.1
    • 3.6.0
    • connect
    • None

    Description

      When a task is stopped in Connect it produces an UNASSIGNED status record.
      Equivalently, when a task is started or restarted in Connect it produces an RUNNING status record in the Connect status topic.

      At the same time rebalances are decoupled from task start and stop. These operations happen in separate executor outside of the main worker thread that performs the rebalance.

      Normally, any delayed and stale UNASSIGNED status records are fenced by the worker that is sending them. This worker is using the StatusBackingStore#putSafe method that will reject any stale status messages (called only for UNASSIGNED or FAILED) as long as the worker is aware of the newer status record that declares a task as RUNNING.

      In cases of fast consecutive rebalances where a task is revoked from one worker and assigned to another one, it has been observed that there is a small time window and thus a race condition during which a RUNNING status record in the new generation is produced and is immediately followed by a delayed UNASSIGNED status record belonging to the same or a previous generation before the worker that sends this message reads the RUNNING status record that corresponds to the latest generation.

      A couple of options are available to remediate this race condition.
      For example a worker that is has started a task can re-write the RUNNING status message in the topic if it reads a stale UNASSIGNED message from a previous generation (that should have been fenced).
      Another option is to ignore stale UNASSIGNED message (messages from an earlier generation than the one in which the task had RUNNING status).

      Worth noting that when this race condition takes place, besides the inaccurate status representation, the actual execution of the tasks remains unaffected (e.g. the tasks are running correctly even though they appear as UNASSIGNED). 

      Attachments

        Issue Links

          Activity

            People

              sagarrao Sagar Rao
              kkonstantine Konstantine Karantasis
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: