Affects Version/s: 0.8.0
Fix Version/s: None
I recently ran into a case where a poorly configured Kafka consumer was able to trigger out of memory exceptions in multiple Kafka brokers. The consumer was configured to have a fetcher.max.wait of Int.MaxInt.
For low volume topics, this configuration causes the consumer to block for frequently, and for long periods of time. Jun Rao informs me that the fetch request will time out after the socket timeout is reached. In our case, this was set to 30s.
With several thousand consumer threads, the fetch request purgatory got into the 100,000-400,000 range, which we believe triggered the out of memory exception. Neha Narkhede claims to have seem similar behavior in other high volume clusters.
It kind of seems like a bad thing that a poorly configured consumer can trigger out of memory exceptions in the broker. I was thinking maybe it makes sense to have the broker try and protect itself from this situation. Here are some potential solutions:
1. Have a broker-side max wait config for fetch requests.
2. Threshold the purgatory size, and either drop the oldest connections in purgatory, or reject the newest fetch requests when purgatory is full.