The new consumer has a lot of loops that look something like
This occurs both in KafkaConsumer but also in NetworkClient.completeAll. These retry loops are actually mostly the behavior we want but there are several cases where they may cause problems:
- In the case of a hard failure we may hang for a long time or indefinitely before realizing the connection is lost.
- In the case where the cluster is malfunctioning or down we may retry forever.
It would probably be better to give a timeout to these. The proposed approach would be to add something like retry.time.ms=60000 and only continue retrying for that period of time.