Two status may not be correct with region failover and current reset logic.
- numberOfRunningProducers in IntermediateResult.
- hasDataProduced in IntermediateResultPartition.
This is because currently only when the ExecutionJobVertex is reset will the related IntermediateResult(and the inner IntermediateResultPartition) get reset. But region failover only resets the affected ExecutionVertex(es), rather than the entire ExecutionJobVertex, leaving the status listed above in an inconsistent state.
Problems below may occur as a result:
- when a FINISHED vertex is restarted and finishes again, the IntermediateResult.numberOfRunningProducers may drop below 0 and throws exception to trigger global failover
- the IntermediateResult.numberOfRunningProducers can be smaller than fact, letting the downstream vertices scheduled earlier than expected
- the IntermediateResultPartition is reset and not started yet but the hasDataProduced remains true
That's why I'd propose we add IntermediateResult status adjust logic to ExecutionVertex.resetForNewExecution()**.