During some of the KIP-572 work, we made things pretty brittle by changing the StreamsPartitionAssignor to send the `INCOMPLETE_SOURCE_TOPIC_METADATA` error code and shut down the entire application if a TimeoutException is hit during the internal topic creation/validation.
Internal topic validation occurs during every rebalance, and we have seen it time out on topic discovery in unstable environments. So shutting down the entire application seems like a step in the wrong direction, and antithetical to the goal of KIP-572 (improving the resiliency of Streams in the face of TimeoutExceptions)
I'm not totally sure what the previous behavior was, but it seems to me we have three options:
- Rethrow the TimeoutException and allow it to kill the thread
- Swallow the TimeoutException and retry the rebalance indefinitely
- Some combination of the above: swallow the TimeoutException but don't retry indefinitely:
- Start a timer and allow retrying rebalances for up the configured task.timeout.ms, the timeout config introduced in KIP-572
- Retry for some constant number of rebalances
I think if we go with option 3, then shutting down the entire application is relatively more palatable, as we have given the environment a chance to stabilize.
But, killing the thread still seems preferable, given the two new features that are coming out soon: the ability to start up new threads, and the improved exception handler that allows the user to choose to shut down the entire application if that's really what they want. Once users have this level of control over the application, we should allow them to decide how they want to handle exceptional cases like this, rather than forcing an option on them (eg shutdown everything)
Imo we should fix this before 2.7 comes out, even if it's just a partial fix (eg we do option 1 in 2.7, but plan to implement option 3 eventually)