Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-10105

Regression in group coordinator dealing with flaky clients joining while leaving

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 2.4.1
    • Fix Version/s: None
    • Component/s: core
    • Labels:
      None
    • Environment:
      Kafka 2.4.1 on jre 11 on debian 9 in docker

      Description

      Since upgrade of a cluster from 1.1.0 to 2.4.1 the broker no longer deals correctly with a consumer sending a join after a leave correctly.

      What happens no is that if a consumer sends a leaving then follows up by trying to send a join again as it is shutting down the group coordinator adds the leaving member to the group but never seems to heartbeat that member.

      Since the consumer is then gone when it joins again after starting it is added as a new member but the zombie member is there and is included in the partition assignment which means that those partitions never get consumed from. What can also happen is that one of the zombies gets group leader so rebalance gets stuck forever and the group is entirely blocked.

      I have not been able to track down where this got introduced between 1.1.0 and 2.4.1 but I will look further into this. Unfortunately the logs are essentially silent about the zombie mebers and I only had INFO level logging on during the issue and by stopping all the consumers in the group and restarting the broker coordinating that group we could get back to a working state.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                william_reynolds William Reynolds
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: