Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-5934

Scheduler in ExecutionGraph null if failure happens in ExecutionGraph.restoreLatestCheckpointedState

    Details

      Description

      If ExecutionGraph.restoreLatestCheckpointedState fails with an exception, then all subsequent recoveries will fail because the scheduler has not been set in the ExecutionGraph.

      I propose to set the scheduler when the ExecutionGraph is created to avoid this problem.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tillrohrmann opened a pull request:

          https://github.com/apache/flink/pull/3437

          FLINK-5934 Set the Scheduler in the ExecutionGraph via its constructor

          Before the scheduler was set when calling ExecutionGraph.scheduleForExecution(). This
          has the disadvantage that the ExecutionGraph has not scheduler set if something else
          went wrong before the scheduleForExecution call. Consequently, the job will be stuck
          in a restart loop because the recovery will fail if there is no Scheduler set. In
          order to solve the problem, the Scheduler is not passed to the ExecutionGraph when
          it is created.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tillrohrmann/flink fixExecutionGraphScheduler

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3437.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3437


          commit 5858c8dc79546ccde68b99993a2695216452bacc
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2017-02-28T16:27:03Z

          FLINK-5938 Replace ExecutionContext by Executor in Scheduler

          In order to remove the Scheduler's dependency on Scala's ExecutionContext and
          Akka's futures, this PR replaces the ExecutionContext by an Executor which is
          used to execute the concurrent handleNewSlot call.

          commit cb0301e1c72dd15def6bef515ea754152be47145
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2017-02-28T14:20:47Z

          FLINK-5934 Set the Scheduler in the ExecutionGraph via its constructor

          Before the scheduler was set when calling ExecutionGraph.scheduleForExecution(). This
          has the disadvantage that the ExecutionGraph has not scheduler set if something else
          went wrong before the scheduleForExecution call. Consequently, the job will be stuck
          in a restart loop because the recovery will fail if there is no Scheduler set. In
          order to solve the problem, the Scheduler is not passed to the ExecutionGraph when
          it is created.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/3437 FLINK-5934 Set the Scheduler in the ExecutionGraph via its constructor Before the scheduler was set when calling ExecutionGraph.scheduleForExecution(). This has the disadvantage that the ExecutionGraph has not scheduler set if something else went wrong before the scheduleForExecution call. Consequently, the job will be stuck in a restart loop because the recovery will fail if there is no Scheduler set. In order to solve the problem, the Scheduler is not passed to the ExecutionGraph when it is created. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink fixExecutionGraphScheduler Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3437.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3437 commit 5858c8dc79546ccde68b99993a2695216452bacc Author: Till Rohrmann <trohrmann@apache.org> Date: 2017-02-28T16:27:03Z FLINK-5938 Replace ExecutionContext by Executor in Scheduler In order to remove the Scheduler's dependency on Scala's ExecutionContext and Akka's futures, this PR replaces the ExecutionContext by an Executor which is used to execute the concurrent handleNewSlot call. commit cb0301e1c72dd15def6bef515ea754152be47145 Author: Till Rohrmann <trohrmann@apache.org> Date: 2017-02-28T14:20:47Z FLINK-5934 Set the Scheduler in the ExecutionGraph via its constructor Before the scheduler was set when calling ExecutionGraph.scheduleForExecution(). This has the disadvantage that the ExecutionGraph has not scheduler set if something else went wrong before the scheduleForExecution call. Consequently, the job will be stuck in a restart loop because the recovery will fail if there is no Scheduler set. In order to solve the problem, the Scheduler is not passed to the ExecutionGraph when it is created.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tillrohrmann opened a pull request:

          https://github.com/apache/flink/pull/3440

          [backport-1.2] FLINK-5934 Set the Scheduler in the ExecutionGraph via its constructor

          This is a backport of #3437.

          This PR is based on #3439.

          Before the scheduler was set when calling ExecutionGraph.scheduleForExecution(). This
          has the disadvantage that the ExecutionGraph has not scheduler set if something else
          went wrong before the scheduleForExecution call. Consequently, the job will be stuck
          in a restart loop because the recovery will fail if there is no Scheduler set. In
          order to solve the problem, the Scheduler is not passed to the ExecutionGraph when
          it is created.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tillrohrmann/flink fixExecutionGraphSchedulerBp1.2

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3440.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3440


          commit 045e2187f6d43679115909b10aaeb78c627b758c
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2017-02-28T16:27:03Z

          FLINK-5938 Replace ExecutionContext by Executor in Scheduler

          In order to remove the Scheduler's dependency on Scala's ExecutionContext and
          Akka's futures, this PR replaces the ExecutionContext by an Executor which is
          used to execute the concurrent handleNewSlot call.

          commit 2525fb2a0e08418f09319568f58ed7daa9284376
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2017-02-28T14:20:47Z

          FLINK-5934 Set the Scheduler in the ExecutionGraph via its constructor

          Before the scheduler was set when calling ExecutionGraph.scheduleForExecution(). This
          has the disadvantage that the ExecutionGraph has not scheduler set if something else
          went wrong before the scheduleForExecution call. Consequently, the job will be stuck
          in a restart loop because the recovery will fail if there is no Scheduler set. In
          order to solve the problem, the Scheduler is not passed to the ExecutionGraph when
          it is created.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/3440 [backport-1.2] FLINK-5934 Set the Scheduler in the ExecutionGraph via its constructor This is a backport of #3437. This PR is based on #3439. Before the scheduler was set when calling ExecutionGraph.scheduleForExecution(). This has the disadvantage that the ExecutionGraph has not scheduler set if something else went wrong before the scheduleForExecution call. Consequently, the job will be stuck in a restart loop because the recovery will fail if there is no Scheduler set. In order to solve the problem, the Scheduler is not passed to the ExecutionGraph when it is created. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink fixExecutionGraphSchedulerBp1.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3440.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3440 commit 045e2187f6d43679115909b10aaeb78c627b758c Author: Till Rohrmann <trohrmann@apache.org> Date: 2017-02-28T16:27:03Z FLINK-5938 Replace ExecutionContext by Executor in Scheduler In order to remove the Scheduler's dependency on Scala's ExecutionContext and Akka's futures, this PR replaces the ExecutionContext by an Executor which is used to execute the concurrent handleNewSlot call. commit 2525fb2a0e08418f09319568f58ed7daa9284376 Author: Till Rohrmann <trohrmann@apache.org> Date: 2017-02-28T14:20:47Z FLINK-5934 Set the Scheduler in the ExecutionGraph via its constructor Before the scheduler was set when calling ExecutionGraph.scheduleForExecution(). This has the disadvantage that the ExecutionGraph has not scheduler set if something else went wrong before the scheduleForExecution call. Consequently, the job will be stuck in a restart loop because the recovery will fail if there is no Scheduler set. In order to solve the problem, the Scheduler is not passed to the ExecutionGraph when it is created.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tillrohrmann opened a pull request:

          https://github.com/apache/flink/pull/3441

          [backport-1.1] FLINK-5934 Set the Scheduler in the ExecutionGraph via its constructor

          This is a backport of #3437 onto `release-1.1`.

          Before the scheduler was set when calling ExecutionGraph.scheduleForExecution(). This
          has the disadvantage that the ExecutionGraph has not scheduler set if something else
          went wrong before the scheduleForExecution call. Consequently, the job will be stuck
          in a restart loop because the recovery will fail if there is no Scheduler set. In
          order to solve the problem, the Scheduler is not passed to the ExecutionGraph when
          it is created.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tillrohrmann/flink fixExecutionGraphSchedulerBp1.1

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3441.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3441


          commit 462ba8ce0029b262bd6755000037c932920bac32
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2017-02-28T14:20:47Z

          FLINK-5934 Set the Scheduler in the ExecutionGraph via its constructor

          Before the scheduler was set when calling ExecutionGraph.scheduleForExecution(). This
          has the disadvantage that the ExecutionGraph has not scheduler set if something else
          went wrong before the scheduleForExecution call. Consequently, the job will be stuck
          in a restart loop because the recovery will fail if there is no Scheduler set. In
          order to solve the problem, the Scheduler is not passed to the ExecutionGraph when
          it is created.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/3441 [backport-1.1] FLINK-5934 Set the Scheduler in the ExecutionGraph via its constructor This is a backport of #3437 onto `release-1.1`. Before the scheduler was set when calling ExecutionGraph.scheduleForExecution(). This has the disadvantage that the ExecutionGraph has not scheduler set if something else went wrong before the scheduleForExecution call. Consequently, the job will be stuck in a restart loop because the recovery will fail if there is no Scheduler set. In order to solve the problem, the Scheduler is not passed to the ExecutionGraph when it is created. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink fixExecutionGraphSchedulerBp1.1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3441.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3441 commit 462ba8ce0029b262bd6755000037c932920bac32 Author: Till Rohrmann <trohrmann@apache.org> Date: 2017-02-28T14:20:47Z FLINK-5934 Set the Scheduler in the ExecutionGraph via its constructor Before the scheduler was set when calling ExecutionGraph.scheduleForExecution(). This has the disadvantage that the ExecutionGraph has not scheduler set if something else went wrong before the scheduleForExecution call. Consequently, the job will be stuck in a restart loop because the recovery will fail if there is no Scheduler set. In order to solve the problem, the Scheduler is not passed to the ExecutionGraph when it is created.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3441

          +1

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3441 +1
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3440

          +1

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3440 +1
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3437

          +1

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3437 +1
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user uce commented on a diff in the pull request:

          https://github.com/apache/flink/pull/3440#discussion_r103708910

          — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionGraph.java —
          @@ -184,31 +202,14 @@

          // ------ Fields that are relevant to the execution and need to be cleared before archiving -------

          • /** The slot provider to use for allocating slots for tasks as they are needed */
          • private SlotProvider slotProvider;
            -
          • /** Strategy to use for restarts */
          • private RestartStrategy restartStrategy;
            -
          • /** The classloader for the user code. Needed for calls into user code classes */
          • private ClassLoader userClassLoader;
            -
            /** The coordinator for checkpoints, if snapshot checkpoints are enabled */
            private CheckpointCoordinator checkpointCoordinator;

          /** Checkpoint stats tracker separate from the coordinator in order to be

          • available after archiving. */
            + @SuppressWarnings("NonSerializableFieldInSerializableClass")
            private CheckpointStatsTracker checkpointStatsTracker;
              • End diff –

          Shouldn't these be `transient` instead?

          Show
          githubbot ASF GitHub Bot added a comment - Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/3440#discussion_r103708910 — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionGraph.java — @@ -184,31 +202,14 @@ // ------ Fields that are relevant to the execution and need to be cleared before archiving ------- /** The slot provider to use for allocating slots for tasks as they are needed */ private SlotProvider slotProvider; - /** Strategy to use for restarts */ private RestartStrategy restartStrategy; - /** The classloader for the user code. Needed for calls into user code classes */ private ClassLoader userClassLoader; - /** The coordinator for checkpoints, if snapshot checkpoints are enabled */ private CheckpointCoordinator checkpointCoordinator; /** Checkpoint stats tracker separate from the coordinator in order to be available after archiving. */ + @SuppressWarnings("NonSerializableFieldInSerializableClass") private CheckpointStatsTracker checkpointStatsTracker; End diff – Shouldn't these be `transient` instead?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/3437

          Thanks for the review @StephanEwen. Merging this PR.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/3437 Thanks for the review @StephanEwen. Merging this PR.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/3437

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/3437
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on a diff in the pull request:

          https://github.com/apache/flink/pull/3440#discussion_r103903694

          — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionGraph.java —
          @@ -184,31 +202,14 @@

          // ------ Fields that are relevant to the execution and need to be cleared before archiving -------

          • /** The slot provider to use for allocating slots for tasks as they are needed */
          • private SlotProvider slotProvider;
            -
          • /** Strategy to use for restarts */
          • private RestartStrategy restartStrategy;
            -
          • /** The classloader for the user code. Needed for calls into user code classes */
          • private ClassLoader userClassLoader;
            -
            /** The coordinator for checkpoints, if snapshot checkpoints are enabled */
            private CheckpointCoordinator checkpointCoordinator;

          /** Checkpoint stats tracker separate from the coordinator in order to be

          • available after archiving. */
            + @SuppressWarnings("NonSerializableFieldInSerializableClass")
            private CheckpointStatsTracker checkpointStatsTracker;
              • End diff –

          This annotation is useless since `ExecutionGraph` is no longer `Serializable`. I think this is an artifact from froward porting the changes to `1.2`. Will remove it.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/3440#discussion_r103903694 — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionGraph.java — @@ -184,31 +202,14 @@ // ------ Fields that are relevant to the execution and need to be cleared before archiving ------- /** The slot provider to use for allocating slots for tasks as they are needed */ private SlotProvider slotProvider; - /** Strategy to use for restarts */ private RestartStrategy restartStrategy; - /** The classloader for the user code. Needed for calls into user code classes */ private ClassLoader userClassLoader; - /** The coordinator for checkpoints, if snapshot checkpoints are enabled */ private CheckpointCoordinator checkpointCoordinator; /** Checkpoint stats tracker separate from the coordinator in order to be available after archiving. */ + @SuppressWarnings("NonSerializableFieldInSerializableClass") private CheckpointStatsTracker checkpointStatsTracker; End diff – This annotation is useless since `ExecutionGraph` is no longer `Serializable`. I think this is an artifact from froward porting the changes to `1.2`. Will remove it.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/3440

          Thanks for the review @StephanEwen and @uce. Merging this PR.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/3440 Thanks for the review @StephanEwen and @uce. Merging this PR.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann closed the pull request at:

          https://github.com/apache/flink/pull/3440

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann closed the pull request at: https://github.com/apache/flink/pull/3440
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/3441

          Thanks for the review @StephanEwen. Merging this PR.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/3441 Thanks for the review @StephanEwen. Merging this PR.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann closed the pull request at:

          https://github.com/apache/flink/pull/3441

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann closed the pull request at: https://github.com/apache/flink/pull/3441
          Hide
          till.rohrmann Till Rohrmann added a comment -

          1.3.0: 5c968563df26642f81ca94df391f31c51b4f37e6
          1.2.1: c22efce098c14e8f08bad1e0065dbd02df6e4dbb
          1.1.5: ba5aa10b9bd20f6137134fe8ef8c882ce9c40a7c

          Show
          till.rohrmann Till Rohrmann added a comment - 1.3.0: 5c968563df26642f81ca94df391f31c51b4f37e6 1.2.1: c22efce098c14e8f08bad1e0065dbd02df6e4dbb 1.1.5: ba5aa10b9bd20f6137134fe8ef8c882ce9c40a7c

            People

            • Assignee:
              till.rohrmann Till Rohrmann
              Reporter:
              till.rohrmann Till Rohrmann
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development