Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-9212

Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 2.3.0, 2.3.1
    • 2.4.0, 2.3.2
    • consumer, offset manager
    • None
    • Linux

    Description

      When running Kafka connect s3 sink connector ( confluent 5.3.0), after one broker got restarted (leaderEpoch updated at this point), the connect worker crashed with the following error : 

      [2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, groupId=connect-ls] Uncaught exception in herder work thread, exiting: (org.apache.kafka.connect.runtime.distributed.DistributedHerder:253)
      org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by times in 30003ms

       

      After investigation, it seems it's because it got fenced when sending ListOffsetRequest in loop and then got timed out , as follows :

      [2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3, groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest, replicaId=-1, partitionTimestamps={connect_ls_config-0={timestamp: -1, maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}}, isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4 rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905)

      [2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3, groupId=connect-ls] Attempt to fetch offsets for partition connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. (org.apache.kafka.clients.consumer.internals.Fetcher:985)

       

      The above happens multiple times until timeout.

       

      According to the debugs, the consumer always get a leaderEpoch of 1 for this topic when starting up :

       
      [2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3, groupId=connect-ls] Updating last seen epoch from null to 1 for partition connect_ls_config-0 (org.apache.kafka.clients.Metadata:178)
       
       
      But according to our brokers log, the leaderEpoch should be 2, as follows :
       
      [2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4] connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader Epoch was: 1 (kafka.cluster.Partition)
       
       
      This make impossible to restart the worker as it will always get fenced and then finally timeout.
       
      It is also impossible to consume with a 2.3 kafka-console-consumer as follows :
       
      kafka-console-consumer --bootstrap-server BOOTSTRAPSERVER:9092 --topic connect_ls_config --from-beginning 
       
      the above will just hang forever ( which is not expected cause there is data) and we can see those debug messages :

      [2019-11-19 22:17:59,124] DEBUG [Consumer clientId=consumer-1, groupId=console-consumer-3844] Attempt to fetch offsets for partition connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. (org.apache.kafka.clients.consumer.internals.Fetcher)
       
       
      Interesting fact, if we do subscribe the same way with kafkacat (1.5.0) we can consume without problem ( must be the way kafkacat is consuming ignoring FENCED_LEADER_EPOCH):
       
      kafkacat -b BOOTSTRAPSERVER:9092 -t connect_ls_config -o beginning
       
       

      Attachments

        Activity

          People

            hachikuji Jason Gustafson
            Lambruschi Yannick
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: