We should consider adding a new state to the KafkaStreams FSM: RESTORING
This would cover the time between the completion of a stable rebalance and the completion of restoration across the client. Currently, Streams will report the state during this time as REBALANCING even though it is generally spending much more time restoring than rebalancing in most cases.
There are a few motivations/benefits behind this idea:
- Observability is a big one: using the umbrella REBALANCING state to cover all aspects of rebalancing -> task initialization -> restoring has been a common source of confusion in the past. It’s also proved to be a time sink for us, during escalations, incidents, mailing list questions, and bug reports. It often adds latency to escalations in particular as we have to go through GTS and wait for the customer to clarify whether their “Kafka Streams is stuck rebalancing” ticket means that it’s literally rebalancing, or just in the REBALANCING state and actually stuck elsewhere in Streams
- Prereq for global thread improvements: for example KIP-406: GlobalStreamThread should honor custom reset policy was ultimately blocked on this as we needed to pause the Streams app while the global thread restored from the appropriate offset. Since there’s absolutely no rebalancing involved in this case, piggybacking on the REBALANCING state would just be shooting ourselves in the foot.