Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-8052

Intermittent INVALID_FETCH_SESSION_EPOCH error on FETCH request

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.1.0
    • 2.3.0
    • clients
    • None

    Description

      I noticed in my logs some weird behavior. I see in logs intermittent log: 

      2019-03-06 14:02:13.024 INFO 1 --- [container-1-C-1] o.a.kafka.clients.FetchSessionHandler : [Consumer clientId=consumer-4, groupId=service-main] Node 2 was unable to process the fetch request with (sessionId=1321134604, epoch=125730): INVALID_FETCH_SESSION_EPOCH.

      which happens every ~1 hour. 

       

      I was wondering if it's my Kafka provider fault so I decided to investigate the problem and I tried to reproduce the issue on my local - with success. My configuration is:

      • Kafka Clients version - 2.0.1
      • Kafka - 2.12_2.1.0

       

      I enabled trace logs for 'org.apache.kafka.clients' and that's what I get:

      2019-03-05 21:04:16.161 DEBUG 3052 --- [container-0-C-1] o.a.kafka.clients.FetchSessionHandler : [Consumer clientId=consumer-3, groupId=service-main] Built incremental fetch (sessionId=197970881, epoch=525) for node 1001. Added (), altered (), removed () out of (itunes-command-19, itunes-command-18, itunes-command-11, itunes-command-10, itunes-command-13, itunes-command-12, itunes-command-15, itunes-command-14, itunes-command-17, itunes-command-16)
      2019-03-05 21:04:16.161 DEBUG 3052 --- [container-0-C-1] o.a.k.c.consumer.internals.Fetcher : [Consumer clientId=consumer-3, groupId=service-main] Sending READ_UNCOMMITTED IncrementalFetchRequest(toSend=(), toForget=(), implied=(itunes-command-19, itunes-command-18, itunes-command-11, itunes-command-10, itunes-command-13, itunes-command-12, itunes-command-15, itunes-command-14, itunes-command-17, itunes-command-16)) to broker localhost:9092 (id: 1001 rack: null)
      2019-03-05 21:04:16.161 TRACE 3052 --- [container-0-C-1] org.apache.kafka.clients.NetworkClient : [Consumer clientId=consumer-3, groupId=service-main] Sending FETCH {replica_id=-1,max_wait_time=500,min_bytes=1,max_bytes=52428800,isolation_level=0,session_id=197970881,epoch=525,topics=[],forgotten_topics_data=[]} with correlation id 629 to node 1001
      2019-03-05 21:04:16.664 TRACE 3052 --- [container-0-C-1] org.apache.kafka.clients.NetworkClient : [Consumer clientId=consumer-3, groupId=service-main] Completed receive from node 1001 for FETCH with correlation id 629, received {throttle_time_ms=0,error_code=0,session_id=197970881,responses=[]}
      2019-03-05 21:04:16.664 DEBUG 3052 --- [container-0-C-1] o.a.kafka.clients.FetchSessionHandler : [Consumer clientId=consumer-3, groupId=service-main] Node 1001 sent an incremental fetch response for session 197970881 with response=(), implied=(itunes-command-19, itunes-command-18, itunes-command-11, itunes-command-10, itunes-command-13, itunes-command-12, itunes-command-15, itunes-command-14, itunes-command-17, itunes-command-16)
      
      
      
      2019-03-05 21:04:16.665 DEBUG 3052 --- [container-0-C-1] o.a.kafka.clients.FetchSessionHandler : [Consumer clientId=consumer-3, groupId=service-main] Built incremental fetch (sessionId=197970881, epoch=526) for node 1001. Added (), altered (), removed () out of (itunes-command-19, itunes-command-18, itunes-command-11, itunes-command-10, itunes-command-13, itunes-command-12, itunes-command-15, itunes-command-14, itunes-command-17, itunes-command-16)
      2019-03-05 21:04:16.665 DEBUG 3052 --- [container-0-C-1] o.a.k.c.consumer.internals.Fetcher : [Consumer clientId=consumer-3, groupId=service-main] Sending READ_UNCOMMITTED IncrementalFetchRequest(toSend=(), toForget=(), implied=(itunes-command-19, itunes-command-18, itunes-command-11, itunes-command-10, itunes-command-13, itunes-command-12, itunes-command-15, itunes-command-14, itunes-command-17, itunes-command-16)) to broker localhost:9092 (id: 1001 rack: null)
      2019-03-05 21:04:16.665 TRACE 3052 --- [container-0-C-1] org.apache.kafka.clients.NetworkClient : [Consumer clientId=consumer-3, groupId=service-main - F630] Sending FETCH {replica_id=-1,max_wait_time=500,min_bytes=1,max_bytes=52428800,isolation_level=0,session_id=197970881,epoch=526,topics=[],forgotten_topics_data=[]} with correlation id 630 to node 1001
      2019-03-05 21:04:17.152 DEBUG 3052 --- [ service-main] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=consumer-3, groupId=service-main - ???] Sending Heartbeat request to coordinator localhost:9092 (id: 2147482646 rack: null)
      2019-03-05 21:04:17.153 TRACE 3052 --- [ service-main] org.apache.kafka.clients.NetworkClient : [Consumer clientId=consumer-3, groupId=service-main - H631] Sending HEARTBEAT {group_id=service-main,generation_id=108,member_id=consumer-3-7e3c718e-f472-4bd8-9404-458ae5ed9a26} with correlation id 631 to node 2147482646
      2019-03-05 21:04:17.254 TRACE 3052 --- [ service-main] org.apache.kafka.clients.NetworkClient : [Consumer clientId=consumer-3, groupId=service-main - ???] Completed receive from node 2147482646 for HEARTBEAT with correlation id 631, received {throttle_time_ms=0,error_code=0}
      
      
      // After two seconds FETCH response has been received...
      2019-03-05 21:04:19.141 TRACE 3052 --- [ service-main] org.apache.kafka.clients.NetworkClient : [Consumer clientId=consumer-3, groupId=service-main - F630] Completed receive from node 1001 for FETCH with correlation id 630, received {throttle_time_ms=0,error_code=0,session_id=197970881,responses=[]}
      2019-03-05 21:04:19.142 DEBUG 3052 --- [ service-main] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=consumer-3, groupId=service-main] Received successful Heartbeat response
      
      // But in the same moment, Fethcer is sending FETCH with same epoch once again
      2019-03-05 21:04:19.142 DEBUG 3052 --- [container-0-C-1] o.a.kafka.clients.FetchSessionHandler : [Consumer clientId=consumer-3, groupId=service-main] Built incremental fetch (sessionId=197970881, epoch=526) for node 1001. Added (), altered (), removed (itunes-command-19, itunes-command-18, itunes-command-11) out of (itunes-command-10, itunes-command-13, itunes-command-12, itunes-command-15, itunes-command-14, itunes-command-17, itunes-command-16)
      2019-03-05 21:04:19.142 DEBUG 3052 --- [container-0-C-1] o.a.k.c.consumer.internals.Fetcher : [Consumer clientId=consumer-3, groupId=service-main] Sending READ_UNCOMMITTED IncrementalFetchRequest(toSend=(), toForget=(itunes-command-19, itunes-command-18, itunes-command-11), implied=(itunes-command-10, itunes-command-13, itunes-command-12, itunes-command-15, itunes-command-14, itunes-command-17, itunes-command-16)) to broker localhost:9092 (id: 1001 rack: null)
      2019-03-05 21:04:19.142 DEBUG 3052 --- [ service-main] o.a.kafka.clients.FetchSessionHandler : [Consumer clientId=consumer-3, groupId=service-main - F630 ] Node 1001 sent an incremental fetch response for session 197970881 with response=(), implied=(itunes-command-10, itunes-command-13, itunes-command-12, itunes-command-15, itunes-command-14, itunes-command-17, itunes-command-16)
      2019-03-05 21:04:19.142 TRACE 3052 --- [container-0-C-1] org.apache.kafka.clients.NetworkClient : [Consumer clientId=consumer-3, groupId=service-main - F632] Sending FETCH {replica_id=-1,max_wait_time=500,min_bytes=1,max_bytes=52428800,isolation_level=0,session_id=197970881,epoch=526,topics=[],forgotten_topics_data=[{topic=itunes-command,partitions=[19,18,11]}]} with correlation id 632 to node 1001
      2019-03-05 21:04:19.152 TRACE 3052 --- [container-0-C-1] org.apache.kafka.clients.NetworkClient : [Consumer clientId=consumer-3, groupId=service-main] Completed receive from node 1001 for FETCH with correlation id 632, received {throttle_time_ms=0,error_code=71,session_id=0,responses=[]}
      2019-03-05 21:04:52.546 TRACE 3052 --- [ service-main] org.apache.kafka.clients.NetworkClient : [Consumer clientId=consumer-3, groupId=service-main] Found least loaded node localhost:9092 (id: 1001 rack: null)
      2019-03-05 21:04:52.546 DEBUG 3052 --- [ service-main] org.apache.kafka.clients.NetworkClient : [Consumer clientId=consumer-3, groupId=service-main] Sending metadata request (type=MetadataRequest, topics=itunes-command) to node localhost:9092 (id: 1001 rack: null)
      2019-03-05 21:04:52.546 TRACE 3052 --- [ service-main] org.apache.kafka.clients.NetworkClient : [Consumer clientId=consumer-3, groupId=service-main] Sending METADATA {topics=[itunes-command],allow_auto_topic_creation=true} with correlation id 633 to node 1001
      2019-03-05 21:04:52.547 INFO 3052 --- [ service-main] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=consumer-3, groupId=service-main] Group coordinator localhost:9092 (id: 2147482646 rack: null) is unavailable or invalid, will attempt rediscovery
      2019-03-05 21:12:04.206 INFO 3052 --- [container-0-C-1] o.a.kafka.clients.FetchSessionHandler : [Consumer clientId=consumer-3, groupId=service-main] Node 1001 was unable to process the fetch request with (sessionId=197970881, epoch=527): INVALID_FETCH_SESSION_EPOCH.

       

      The first request has been sent with epoch 525 and that's ok.

      The second request has been sent with epoch 526, and... in the meantime, AbstractCoordinator starts sending Heartbeat request. Everything stops for ~2 seconds. After these 2 seconds, the response for FETCH request has been received. But in the same moment, Fetcher decides to send FETCH request with the same epoch once again! That results with INVALID_FETCH_SESSION_EPOCH response from Kafka node (which is understandable).

      I'm wondering if is there some workaround for this problem?
      And the second question is - does it impact my service performance?

      Attachments

        Issue Links

          Activity

            People

              rsivaram Rajini Sivaram
              bartekjakub Bartek Jakub
              Jason Gustafson Jason Gustafson
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: