Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-8477

Cannot consume and request metadata for __consumer_offsets topic in Kafka 2.2

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.2.0
    • None
    • clients
    • None

    Description

      We have an application that consumes from the __consumer_offsets topic to report lag metrics. When we upgraded its kafka-clients dependency from 2.0.1 to 2.2.0, it crashed with:

      Exception in thread "main" org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by times in 30001ms
      

      I created a minimal reproduction at https://github.com/mikemintz/kafka-2.2.0-consumer-offset-metadata-bug and I'm uploading a zip of this code for posterity.

      In particular, the behavior happens when I call KafkaConsumer.assign(), then poll(), then endOffsets(). This behavior only happens for the __consumer_offsets topic. It also only happens on the Kafka cluster that we run in production, which runs Kafka 2.2.0. The error does not occur on a freshly created Kafka cluster, and I can't get it to reproduce with EmbeddedKafka.

      It works fine with both Kafka 2.0.1 and Kafka 2.1.1, so something broke between 2.1.1. and 2.2.0. Based on the 2.2.0 changelog and the client log messages (attached), it looks like it may have been introduced in KAFKA-7738 (cc mumrah). It gets in a loop, repeating the following block of log messages:

      2019-06-03 23:24:15 DEBUG NetworkClient:1073 - [Consumer clientId=test.mikemintz.lag-tracker-reproduce, groupId=test.mikemintz.lag-tracker-reproduce] Sending metadata request (type=MetadataRequest, topics=__consumer_offsets) to node REDACTED:9094 (id: 2134 rack: us-west-2b)
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5862 to 5862 for partition __consumer_offsets-0
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6040 to 6040 for partition __consumer_offsets-10
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6008 to 6008 for partition __consumer_offsets-20
      2019-06-03 23:24:15 DEBUG Metadata:208 - Not replacing existing epoch 6153 with new epoch 6152
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5652 to 5652 for partition __consumer_offsets-30
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6081 to 6081 for partition __consumer_offsets-39
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5629 to 5629 for partition __consumer_offsets-9
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5983 to 5983 for partition __consumer_offsets-11
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5896 to 5896 for partition __consumer_offsets-31
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5278 to 5278 for partition __consumer_offsets-13
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6026 to 6026 for partition __consumer_offsets-18
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5608 to 5608 for partition __consumer_offsets-22
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6025 to 6025 for partition __consumer_offsets-32
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5685 to 5685 for partition __consumer_offsets-8
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5730 to 5730 for partition __consumer_offsets-43
      2019-06-03 23:24:15 DEBUG Metadata:208 - Not replacing existing epoch 5957 with new epoch 5956
      2019-06-03 23:24:15 DEBUG Metadata:208 - Not replacing existing epoch 6047 with new epoch 6046
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6090 to 6090 for partition __consumer_offsets-1
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5821 to 5821 for partition __consumer_offsets-6
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5909 to 5909 for partition __consumer_offsets-41
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5805 to 5805 for partition __consumer_offsets-27
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5670 to 5670 for partition __consumer_offsets-48
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6220 to 6220 for partition __consumer_offsets-5
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5596 to 5596 for partition __consumer_offsets-15
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5896 to 5896 for partition __consumer_offsets-35
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5858 to 5858 for partition __consumer_offsets-25
      2019-06-03 23:24:15 DEBUG Metadata:208 - Not replacing existing epoch 5927 with new epoch 5926
      2019-06-03 23:24:15 DEBUG Metadata:208 - Not replacing existing epoch 5938 with new epoch 5937
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5792 to 5792 for partition __consumer_offsets-36
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5984 to 5984 for partition __consumer_offsets-44
      2019-06-03 23:24:15 DEBUG Metadata:208 - Not replacing existing epoch 5932 with new epoch 5931
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5979 to 5979 for partition __consumer_offsets-16
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5804 to 5804 for partition __consumer_offsets-17
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6179 to 6179 for partition __consumer_offsets-3
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5953 to 5953 for partition __consumer_offsets-45
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6276 to 6276 for partition __consumer_offsets-24
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6119 to 6119 for partition __consumer_offsets-38
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5784 to 5784 for partition __consumer_offsets-33
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5665 to 5665 for partition __consumer_offsets-23
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5812 to 5812 for partition __consumer_offsets-28
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6014 to 6014 for partition __consumer_offsets-2
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6199 to 6199 for partition __consumer_offsets-12
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5908 to 5908 for partition __consumer_offsets-19
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5697 to 5697 for partition __consumer_offsets-14
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5673 to 5673 for partition __consumer_offsets-4
      2019-06-03 23:24:15 DEBUG Metadata:208 - Not replacing existing epoch 5921 with new epoch 5920
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5940 to 5940 for partition __consumer_offsets-49
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5836 to 5836 for partition __consumer_offsets-42
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5827 to 5827 for partition __consumer_offsets-7
      2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5670 to 5670 for partition __consumer_offsets-21
      2019-06-03 23:24:15 DEBUG Metadata:370 - Updated cluster metadata version 293 to MetadataCache{cluster=Cluster(id = 5Cs2i3WbTzKHbrF2GzLwaQ, nodes = [REDACTED:9094 (id: 2132 rack: us-west-2a), REDACTED:9094 (id: 2133 rack: us-west-2a), REDACTED:9094 (id: 2131 rack: us-west-2c), REDACTED:9094 (id: 2136 rack: us-west-2b), REDACTED:9094 (id: 2135 rack: us-west-2c), REDACTED:9094 (id: 2134 rack: us-west-2b), REDACTED:9094 (id: 2137 rack: us-west-2c)], partitions = [Partition(topic = __consumer_offsets, partition = 13, leader = 2136, replicas = [2136,2131,2132], isr = [2132,2136,2131], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 46, leader = 2134, replicas = [2134,2133,2135], isr = [2135,2134,2133], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 9, leader = 2132, replicas = [2132,2134,2131], isr = [2132,2134,2131], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 42, leader = 2133, replicas = [2133,2136,2135], isr = [2135,2136,2133], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 21, leader = 2136, replicas = [2136,2131,2133], isr = [2131,2136,2133], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 17, leader = 2133, replicas = [2133,2136,2131], isr = [2131,2136,2133], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 30, leader = 2136, replicas = [2136,2133,2131], isr = [2136,2133,2131], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 26, leader = 2132, replicas = [2132,2134,2135], isr = [2135,2132,2134], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 5, leader = 2136, replicas = [2136,2131,2132], isr = [2132,2131,2136], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 38, leader = 2134, replicas = [2134,2133,2135], isr = [2135,2134,2133], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 1, leader = 2132, replicas = [2132,2134,2131], isr = [2132,2134,2131], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 34, leader = 2132, replicas = [2132,2136,2135], isr = [2135,2132,2136], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 16, leader = 2134, replicas = [2134,2132,2135], isr = [2132,2134,2135], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 45, leader = 2133, replicas = [2133,2136,2135], isr = [2135,2136,2133], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 12, leader = 2133, replicas = [2133,2134,2131], isr = [2134,2131,2133], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 41, leader = 2136, replicas = [2136,2133,2135], isr = [2135,2136,2133], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 24, leader = 2136, replicas = [2136,2133,2131], isr = [2131,2136,2133], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 20, leader = 2132, replicas = [2132,2136,2135], isr = [2132,2136,2135], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 49, leader = 2136, replicas = [2136,2132,2135], isr = [2135,2132,2136], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 0, leader = 2134, replicas = [2134,2132,2135], isr = [2132,2134,2135], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 29, leader = 2134, replicas = [2134,2132,2135], isr = [2135,2132,2134], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 25, leader = 2134, replicas = [2134,2135,2133], isr = [2134,2133,2135], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 8, leader = 2133, replicas = [2133,2136,2131], isr = [2131,2136,2133], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 37, leader = 2133, replicas = [2133,2136,2135], isr = [2135,2136,2133], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 4, leader = 2133, replicas = [2133,2134,2131], isr = [2134,2131,2133], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 33, leader = 2136, replicas = [2136,2133,2135], isr = [2135,2136,2133], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 15, leader = 2132, replicas = [2132,2136,2135], isr = [2132,2136,2135], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 48, leader = 2132, replicas = [2132,2134,2135], isr = [2135,2132,2134], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 11, leader = 2132, replicas = [2132,2135,2136], isr = [2132,2136,2135], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 44, leader = 2133, replicas = [2133,2134,2135], isr = [2135,2134,2133], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 23, leader = 2132, replicas = [2132,2136,2135], isr = [2132,2136,2135], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 19, leader = 2136, replicas = [2136,2133,2131], isr = [2131,2136,2133], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 32, leader = 2136, replicas = [2136,2131,2133], isr = [2131,2136,2133], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 28, leader = 2136, replicas = [2136,2133,2135], isr = [2135,2136,2133], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 7, leader = 2136, replicas = [2136,2132,2131], isr = [2132,2136,2131], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 40, leader = 2134, replicas = [2134,2132,2135], isr = [2135,2132,2134], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 3, leader = 2136, replicas = [2136,2133,2135], isr = [2136,2133,2135], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 36, leader = 2134, replicas = [2134,2133,2135], isr = [2135,2134,2133], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 47, leader = 2132, replicas = [2132,2134,2135], isr = [2135,2132,2134], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 14, leader = 2134, replicas = [2134,2135,2132], isr = [2132,2134,2135], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 43, leader = 2134, replicas = [2134,2132,2135], isr = [2135,2132,2134], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 10, leader = 2132, replicas = [2132,2136,2135], isr = [2132,2136,2135], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 22, leader = 2134, replicas = [2134,2135,2133], isr = [2134,2133,2135], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 18, leader = 2136, replicas = [2136,2131,2132], isr = [2132,2131,2136], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 31, leader = 2133, replicas = [2133,2134,2135], isr = [2135,2134,2133], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 27, leader = 2134, replicas = [2134,2133,2131], isr = [2134,2131,2133], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 39, leader = 2134, replicas = [2134,2132,2135], isr = [2135,2132,2134], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 6, leader = 2133, replicas = [2133,2134,2131], isr = [2134,2131,2133], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 35, leader = 2134, replicas = [2134,2132,2135], isr = [2135,2132,2134], offlineReplicas = []), Partition(topic = __consumer_offsets, partition = 2, leader = 2134, replicas = [2134,2135,2132], isr = [2132,2134,2135], offlineReplicas = [])], controller = REDACTED:9094 (id: 2132 rack: us-west-2a))}
      2019-06-03 23:24:15 DEBUG Fetcher:700 - [Consumer clientId=test.mikemintz.lag-tracker-reproduce, groupId=test.mikemintz.lag-tracker-reproduce] Leader for partition __consumer_offsets-46 is unknown for fetching offset
      2019-06-03 23:24:15 DEBUG Fetcher:700 - [Consumer clientId=test.mikemintz.lag-tracker-reproduce, groupId=test.mikemintz.lag-tracker-reproduce] Leader for partition __consumer_offsets-26 is unknown for fetching offset
      2019-06-03 23:24:15 DEBUG Fetcher:700 - [Consumer clientId=test.mikemintz.lag-tracker-reproduce, groupId=test.mikemintz.lag-tracker-reproduce] Leader for partition __consumer_offsets-34 is unknown for fetching offset
      2019-06-03 23:24:15 DEBUG Fetcher:700 - [Consumer clientId=test.mikemintz.lag-tracker-reproduce, groupId=test.mikemintz.lag-tracker-reproduce] Leader for partition __consumer_offsets-29 is unknown for fetching offset
      2019-06-03 23:24:15 DEBUG Fetcher:700 - [Consumer clientId=test.mikemintz.lag-tracker-reproduce, groupId=test.mikemintz.lag-tracker-reproduce] Leader for partition __consumer_offsets-37 is unknown for fetching offset
      2019-06-03 23:24:15 DEBUG Fetcher:700 - [Consumer clientId=test.mikemintz.lag-tracker-reproduce, groupId=test.mikemintz.lag-tracker-reproduce] Leader for partition __consumer_offsets-40 is unknown for fetching offset
      2019-06-03 23:24:15 DEBUG Fetcher:700 - [Consumer clientId=test.mikemintz.lag-tracker-reproduce, groupId=test.mikemintz.lag-tracker-reproduce] Leader for partition __consumer_offsets-47 is unknown for fetching offset
      

      Attachments

        1. kafka-2.2.0-consumer-offset-metadata-bug-master.zip
          4 kB
          Mike Mintz
        2. logs.txt
          310 kB
          Mike Mintz

        Activity

          People

            mumrah David Arthur
            mikemintz Mike Mintz
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: