With fine grained recovery introduced in 1.9.0, the fullRestart metric only counts how many times the entire graph has been restarted, not including the number of fine grained failure restarts.
As many users leverage this metric for failure detecting monitoring and alerting, I'd propose to make it also count fine grained restarts.
The concrete proposal is:
- Add a counter numberOfRestartsCounter in ExecutionGraph to count all restarts. The counter is not to be registered to metric groups.
- Let fullRestart query the value of the counter, instead of ExecutionGraph#globalModVersion
- increment numberOfRestartsCounter in ExecutionGraph#incrementGlobalModVersion()
- increment numberOfRestartsCounter in AdaptedRestartPipelinedRegionStrategyNG#restartTasks(...), to ensure that the fine grained recovery really happens