Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-21707

Job is possible to hang when restarting a FINISHED task with POINTWISE BLOCKING consumers

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

      Description

      Job is possible to hang when restarting a FINISHED task with POINTWISE BLOCKING consumers. This is because PipelinedRegionSchedulingStrategy#onExecutionStateChange() will try to schedule all the consumer tasks/regions of the finished ExecutionJobVertex, even though the regions are not the exact consumers of the finished ExecutionVertex. In this case, some of the regions can be in non-CREATED state because they are not connected to nor affected by the restarted tasks. However, PipelinedRegionSchedulingStrategy#maybeScheduleRegion() does not allow to schedule a non-CREATED region and will throw an Exception and breaks the scheduling of all the other regions. One example to show this problem case can be found at PipelinedRegionSchedulingITCase#testRecoverFromPartitionException .

      To fix the problem, we can add a filter in PipelinedRegionSchedulingStrategy#onExecutionStateChange() to only trigger the scheduling of regions in CREATED state.

        Attachments

        Issue Links

          Activity

            People

            • Assignee:
              zhuzh Zhu Zhu
              Reporter:
              zhuzh Zhu Zhu

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment