Description
We have seen several occasions where shifting partitions in a Kafka cluster results in some Samza containers getting stuck with:
2014-10-22 15:10:48 BrokerProxy [INFO] Creating new SimpleConsumer for host eat1-app582.corp:10251 for system kafka 2014-10-22 15:10:48 BrokerProxy [WARN] Got non-recoverable error codes during multifetch. Throwing an exception to trigger reconnect. Errors: Error([all-service-call-events,10],3,kafka.common.UnknownTopicOrPartitionException) 2014-10-22 15:10:48 BrokerProxy [WARN] Restarting consumer due to kafka.common.UnknownTopicOrPartitionException. Turn on debugging to get a full stack trace. 2014-10-22 15:10:58 BrokerProxy [INFO] Creating new SimpleConsumer for host eat1-app582.corp:10251 for system kafka 2014-10-22 15:10:58 BrokerProxy [WARN] Got non-recoverable error codes during multifetch. Throwing an exception to trigger reconnect. Errors: Error([all-service-call-events,10],3,kafka.common.UnknownTopicOrPartitionException) 2014-10-22 15:10:58 BrokerProxy [WARN] Restarting consumer due to kafka.common.UnknownTopicOrPartitionException. Turn on debugging to get a full stack trace. 2014-10-22 15:11:08 BrokerProxy [INFO] Creating new SimpleConsumer for host eat1-app582.corp:10251 for system kafka
The problem appears to be a misunderstanding in how Kafka works. If a partition is moved to another broker, and the BrokerProxy continues fetching on the old broker, it will throw an UnknownTopicOrPartitionException, and try and try and reconnect to the same broker. It will do this indefinitely. Instead, the BrokerProxy should abdicate the TopicAndPartition, and allow the new broker to pick it up.