Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-15808

kstreams app consumer threads all fenced due to outdated static membership ids during group coordinator migration

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.5.0
    • None
    • group-coordinator, streams
    • None

    Description

      I am wondering if my team encountered a bug with static membership in a KStream application (river.stream-filter in attached logs). There was an abrupt shutdown of all of it's consumer threads after they all became fenced on membership id conflicts. It occurred during a rolling deployment to the Kafka broker fleet which triggered a migration of the group coordinator to a new broker (from kafka-broker-55002 to kafka-broker-50003).

      My understanding is that a new group coordinator shouldn't start interacting with consumer threads until it has finished loading membership metadata during the migration. In this case, the new group coordinator started interacting with the thread consumers before it finished loading the latest membership information in the most recent generation of the membership metadata. As the migration was incomplete, the new group coordinator had outdated consumers membership IDs, which caused all the streaming app consumer threads to get fenced due to conflicting membership ids.

      Major timeline in the attached logs:

      • 07:27:04 - 07:32:25 UTC: Deployment to broker kafka-broker-55002.
      • 07:32:07 - 07:32:10 UTC: 36 unavailable broker messages seen for kafka-broker-55002 and 3 for kafka-broker-50003 (overlapping in time).
      • 07:32:21,505 UTC: Controller kafka-broker-51005 adds broker kafka-broker-55002 back as a live broker.
      • 07:32:23,276 UTC: kafka-broker-50003 last mention of loading membership data.
      • 07:32:34,175 - 07:32:34,278 UTC - kstream app threads (consumers in group) using latest member IDs interact with group coordinator kafka-broker-50003 which had outdated member IDs. Consumers threads from kstream app (river.stream-filter) are fenced and ejected from the consumer group.

      When kafka-broker-50003 was assigned group coordinator it was loading the latest (737) generation membership metadata, but appears to have never finished. Review of a previous successful migration shows multiple log lines for a new generation, but only one was seen in this deployment. Below are some log lines that I found from a kstream host kafka.log that highlights the member id difference in for generation 737:

      First generation and last seen log line on while the app was talking to group coordinator kafka-broker-50003:

      {{{{[2023-08-04 07:32:16,564] INFO Loaded member MemberMetadata(memberId=kafka-streams-55007.somewhere.com-2-8a5b4e68-7065-4125-9ba0-615b538a0e79, groupInstanceId=Some(kafka-streams-55007.somewhere.com-2), clientId=river.stream-filter.v1-d3473182-6af5-410d-82ef-4b85ea1b9a66-StreamThread-2-consumer, clientHost=/XX.XX.XX.XX, sessionTimeoutMs=45000, rebalanceTimeoutMs=300000, supportedProtocols=List(stream)) in group river.stream-filter.v1 with generation 686. (kafka.coordinator.group.GroupMetadata$) }}}}

      {}{{{}}}{}{{...{}}}

      {{[2023-08-04 07:32:23,276] INFO Loaded member MemberMetadata(memberId=kafka-streams-55007.somewhere.com-2-8a5b4e68-7065-4125-9ba0-615b538a0e79, groupInstanceId=Some(kafka-streams-55007.somewhere.com-2), clientId=river.stream-filter.v1-d3473182-6af5-410d-82ef-4b85ea1b9a66-StreamThread-2-consumer, clientHost=/XX.XX.XX.XX, sessionTimeoutMs=45000, rebalanceTimeoutMs=300000, supportedProtocols=List(stream)) in group river.stream-filter.v1 with generation 737. (kafka.coordinator.group.GroupMetadata$)}}

      The memberId was the same in both generations (ending with 615b538a0e79)

      The consumers got fenced after the controller sent a message around the time of the above log line.

      Later kafka-broker-55002 gets the group coordinator assignment back. These log lines look similar to a previous successful migration, and the memberId changes (from ending with 615b538a0e79 to 4ba4c7f17eea)

       

      [2023-08-04 07:32:44,388] INFO Loaded member MemberMetadata(memberId=kafka-streams-55007.somewhere.com-2-8a5b4e68-7065-4125-9ba0-615b538a0e79, groupInstanceId=Some(kafka-streams-55007.somewhere.com-2), clientId=river.stream-filter.v1-d3473182-6af5-410d-82ef-4b85ea1b9a66-StreamThread-2-consumer, clientHost=/XX.XX.XX.XX, sessionTimeoutMs=45000, rebalanceTimeoutMs=300000, supportedProtocols=List(stream)) in group river.stream-filter.v1 with generation 686. (kafka.coordinator.group.GroupMetadata$)

      ...

      [2023-08-04 07:33:08,824] INFO Loaded member MemberMetadata(memberId=kafka-streams-55007.somewhere.com-2-e1dd9fd3-5887-4cd3-96a7-4ba4c7f17eea, groupInstanceId=Some(kafka-streams-55007.somewhere.com-2), clientId=river.stream-filter.v1-d3473182-6af5-410d-82ef-4b85ea1b9a66-StreamThread-2-consumer, clientHost=/XX.XX.XX.XX, sessionTimeoutMs=45000, rebalanceTimeoutMs=300000, supportedProtocols=List(stream)) in group river.stream-filter.v1 with generation 737. (kafka.coordinator.group.GroupMetadata$)

      I've attached more complete INFO kafka logs from the two group coordinators and a kstream app host.

      Attachments

        1. broker-55002-kafka.log
          2.53 MB
          Carrie Hazelton
        2. broker-50003-kafka.log
          6.73 MB
          Carrie Hazelton
        3. kstreams-55002-kafka.log
          925 kB
          Carrie Hazelton

        Activity

          People

            Unassigned Unassigned
            chazel Carrie Hazelton
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: