Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-21707

Job is possible to hang when restarting a FINISHED task with POINTWISE BLOCKING consumers

    XMLWordPrintableJSON

Details

    Description

      Job is possible to hang when restarting a FINISHED task with POINTWISE BLOCKING consumers. This is because PipelinedRegionSchedulingStrategy#onExecutionStateChange() will try to schedule all the consumer tasks/regions of the finished ExecutionJobVertex, even though the regions are not the exact consumers of the finished ExecutionVertex. In this case, some of the regions can be in non-CREATED state because they are not connected to nor affected by the restarted tasks. However, PipelinedRegionSchedulingStrategy#maybeScheduleRegion() does not allow to schedule a non-CREATED region and will throw an Exception and breaks the scheduling of all the other regions. One example to show this problem case can be found at PipelinedRegionSchedulingITCase#testRecoverFromPartitionException .

      To fix the problem, we can add a filter in PipelinedRegionSchedulingStrategy#onExecutionStateChange() to only trigger the scheduling of regions in CREATED state.

      Attachments

        Issue Links

          Activity

            People

              zhuzh Zhu Zhu
              zhuzh Zhu Zhu
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: