Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-13126

Overflow in joinGroupTimeoutMs when max.poll.interval.ms is MAX_VALUE leads to missing rebalances

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.1.0
    • consumer
    • None

    Description

      In older versions of Kafka Streams, the max.poll.interval.ms config was overridden by default to Integer.MAX_VALUE. Even after we removed this override, users of both the plain consumer client and kafka streams still set the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an overflow when computing the joinGroupTimeoutMs and results in it being set to the request.timeout.ms instead, which is much lower.

      This can easily make consumers drop out of the group, since they must rejoin now within 30s (by default) but have no obligation to almost ever call poll() given the high max.poll.interval.ms – basically they will only do so after processing the last record from the previously polled batch. So in heavy processing cases, where each record takes a long time to process, or when using a very large  max.poll.records, it can be difficult to make any progress at all before dropping out and needing to rejoin. And of course, the rebalance that is kicked off upon this member rejoining can result in many of the other members in the group dropping out as well, leading to an endless cycle of missed rebalances.

      We just need to check for overflow and fix it to Integer.MAX_VALUE when it occurs. The workaround until then is of course to just set the max.poll.interval.ms to MAX_VALUE - 5000 (5s is the JOIN_GROUP_TIMEOUT_LAPSE)

      Attachments

        Activity

          People

            ableegoldman A. Sophie Blee-Goldman
            ableegoldman A. Sophie Blee-Goldman
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: