Description
We have recently seen cases where brokers end up in a bad state where fetch session evictions occur at a high rate (> 16 per second) after a roll. This increase in eviction rate included the following pattern in our logs:
broker 6: October 31st 2019, 17:52:45.496 Created a new incremental FetchContext for session id 2046264334, epoch 9790: added (), updated (), removed () broker 6: October 31st 2019, 17:52:45.496 Created a new incremental FetchContext for session id 2046264334, epoch 9791: added (), updated (), removed () broker 6: October 31st 2019, 17:52:45.500 Created a new incremental FetchContext for session id 2046264334, epoch 9792: added (), updated (lkc-7nv6o_tenant_soak_topic_144p-67), removed () broker 6: October 31st 2019, 17:52:45.501 Created a new incremental FetchContext for session id 2046264334, epoch 9793: added (), updated (lkc-7nv6o_tenant_soak_topic_144p-59, lkc-7nv6o_tenant_soak_topic_144p-123, lkc-7nv6o_tenant_soak_topic_144p-11, lkc-7nv6o_tenant_soak_topic_144p-3, lkc-7nv6o_tenant_soak_topic_144p-67, lkc-7nv6o_tenant_soak_topic_144p-115), removed () broker 6: October 31st 2019, 17:52:45.501 Evicting stale FetchSession 2046264334. broker 6: October 31st 2019, 17:52:45.502 Session error for 2046264334: no such session ID found. broker 4: October 31st 2019, 17:52:45.813 [ReplicaFetcher replicaId=4, leaderId=6, fetcherId=0] Node 6 was unable to process the fetch request with (sessionId=2046264334, epoch=9793): FETCH_SESSION_ID_NOT_FOUND.
This pattern appears to be problematic for two reasons. Firstly, the replica fetcher for broker 4 was clearly able to send multiple incremental fetch requests to broker 6, and receive replies, and did so right up to the point where broker 6 evicted its fetch session within milliseconds of multiple fetch requests. The second problem is that replica fetchers are considered privileged for the fetch session cache, and should not be evicted by consumer fetch sessions. This cluster only has 12 brokers and 1000 fetch session cache slots (the default for max.incremental.fetch.session.cache.slots), and it thus very unlikely that this session should have been evicted by another replica fetcher session.
This cluster also appears to be causing cycles of fetch session evictions where the cluster never stabilizes into a state where fetch sessions are not evicted. The above logs are the best example I could find of a case where a session clearly should not have been evicted.
Attachments
Issue Links
- links to