Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-9140

Consumer gets stuck rejoining the group indefinitely

    XMLWordPrintableJSON

Details

    Description

      There seems to be a race condition that is now causing a rejoining member to potentially get stuck infinitely initiating a rejoin. The relevant client logs are attached (streams.log.tgz; all others attachments are broker logs), but basically it repeats this message (and nothing else) continuously until killed/shutdown:

       

      [2019-11-05 01:53:54,699] INFO [Consumer clientId=StreamsUpgradeTest-a4c1cff8-7883-49cd-82da-d2cdfc33a2f0-StreamThread-1-consumer, groupId=StreamsUpgradeTest] Generation data was cleared by heartbeat thread. Initiating rejoin. (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
      

       

      The message that appears was added as part of the bugfix (PR 7460) for this related race condition: KAFKA-8104.

      This issue was uncovered by the Streams version probing upgrade test, which fails with a varying frequency. Here is the rate of failures for different system test runs so far:

      trunk (cooperative): 1/1 and 2/10 failures

      2.4 (cooperative) : 0/10 and 1/15 failures

      trunk (eager): 0/10 failures

      I've kicked off some high-repeat runs to complete overnight and hopefully shed more light.

      Note that I have also kicked off runs of both 2.4 and trunk with the PR for KAFKA-8104 reverted. Both of them saw 2/10 failures, due to hitting the bug that was fixed by PR 7460. It is therefore unclear whether PR 7460 introduced another or a new race condition/bug, or merely uncovered an existing one that previously would have first failed due to KAFKA-8104.

       

      Attachments

        1. streams.log.tgz
          3.22 MB
          A. Sophie Blee-Goldman
        2. info.tgz
          8 kB
          A. Sophie Blee-Goldman
        3. kafka-data-logs-1.tgz
          387 kB
          A. Sophie Blee-Goldman
        4. kafka-data-logs-2.tgz
          456 kB
          A. Sophie Blee-Goldman
        5. server-start-stdout-stderr.log.tgz
          1.97 MB
          A. Sophie Blee-Goldman
        6. debug.tgz
          5.23 MB
          A. Sophie Blee-Goldman

        Issue Links

          Activity

            People

              guozhang Guozhang Wang
              ableegoldman A. Sophie Blee-Goldman
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: