Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
1.12.2, 1.13.0
Description
Job is possible to hang when restarting a FINISHED task with POINTWISE BLOCKING consumers. This is because PipelinedRegionSchedulingStrategy#onExecutionStateChange() will try to schedule all the consumer tasks/regions of the finished ExecutionJobVertex, even though the regions are not the exact consumers of the finished ExecutionVertex. In this case, some of the regions can be in non-CREATED state because they are not connected to nor affected by the restarted tasks. However, PipelinedRegionSchedulingStrategy#maybeScheduleRegion() does not allow to schedule a non-CREATED region and will throw an Exception and breaks the scheduling of all the other regions. One example to show this problem case can be found at PipelinedRegionSchedulingITCase#testRecoverFromPartitionException .
To fix the problem, we can add a filter in PipelinedRegionSchedulingStrategy#onExecutionStateChange() to only trigger the scheduling of regions in CREATED state.
Attachments
Issue Links
- relates to
-
FLINK-21734 Allow BLOCKING result partition to be individually consumable
- Closed
-
FLINK-21735 Harden JobMaster#updateTaskExecutionState()
- Closed
- links to