Job is possible to hang when restarting a FINISHED task with POINTWISE BLOCKING consumers. This is because PipelinedRegionSchedulingStrategy#onExecutionStateChange() will try to schedule all the consumer tasks/regions of the finished ExecutionJobVertex, even though the regions are not the exact consumers of the finished ExecutionVertex. In this case, some of the regions can be in non-CREATED state because they are not connected to nor affected by the restarted tasks. However, PipelinedRegionSchedulingStrategy#maybeScheduleRegion() does not allow to schedule a non-CREATED region and will throw an Exception and breaks the scheduling of all the other regions. One example to show this problem case can be found at PipelinedRegionSchedulingITCase#testRecoverFromPartitionException .
To fix the problem, we can add a filter in PipelinedRegionSchedulingStrategy#onExecutionStateChange() to only trigger the scheduling of regions in CREATED state.