Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-4933

ExecutionGraph.scheduleOrUpdateConsumers can fail the ExecutionGraph

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.2.0, 1.1.3
    • Fix Version/s: 1.2.0, 1.1.4
    • Labels:
      None

      Description

      Currently the ExecutionGraph.scheduleOrUpdateConsumers can fail the whole ExecutionGraph if it cannot find the corresponding Execution. This situation can occur in the restarting scenario where we have a late callback trying to update its consumers. In this case, the call should forward the exception back to the caller and not fail the ExecutionGraph.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tillrohrmann opened a pull request:

          https://github.com/apache/flink/pull/2700

          FLINK-4933 [exec graph] Don't let the EG fail in case of a failing scheduleOrUpdateConsumers call

          Instead of failing the complete ExecutionGraph, a failing scheduleOrUpdateConsumers call
          will be reported back to the caller. The caller can then decide what to do. Per default,
          it will fail the calling task.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tillrohrmann/flink fixScheduleOrUpdateConsumers

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/2700.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2700


          commit e51f0a56762c7f12acc215a53ffce2af28d38583
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2016-10-27T09:41:29Z

          FLINK-4933 [exec graph] Don't let the EG fail in case of a failing scheduleOrUpdateConsumers call

          Instead of failing the complete ExecutionGraph, a failing scheduleOrUpdateConsumers call
          will be reported back to the caller. The caller can then decide what to do. Per default,
          it will fail the calling task.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/2700 FLINK-4933 [exec graph] Don't let the EG fail in case of a failing scheduleOrUpdateConsumers call Instead of failing the complete ExecutionGraph, a failing scheduleOrUpdateConsumers call will be reported back to the caller. The caller can then decide what to do. Per default, it will fail the calling task. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink fixScheduleOrUpdateConsumers Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2700.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2700 commit e51f0a56762c7f12acc215a53ffce2af28d38583 Author: Till Rohrmann <trohrmann@apache.org> Date: 2016-10-27T09:41:29Z FLINK-4933 [exec graph] Don't let the EG fail in case of a failing scheduleOrUpdateConsumers call Instead of failing the complete ExecutionGraph, a failing scheduleOrUpdateConsumers call will be reported back to the caller. The caller can then decide what to do. Per default, it will fail the calling task.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tillrohrmann opened a pull request:

          https://github.com/apache/flink/pull/2701

          [backport] FLINK-4933 [exec graph] Don't let the EG fail in case of a failing scheduleOrUpdateConsumers call

          This is a backport for the release-1.1 branch. The only thing adapted is the added test case.

          Instead of failing the complete ExecutionGraph, a failing scheduleOrUpdateConsumers call
          will be reported back to the caller. The caller can then decide what to do. Per default,
          it will fail the calling task.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tillrohrmann/flink backportFixScheduleOrUpdateConsumers

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/2701.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2701


          commit b17fada0e60ad9986680c89effa628944973d999
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2016-10-27T09:41:29Z

          FLINK-4933 [exec graph] Don't let the EG fail in case of a failing scheduleOrUpdateConsumers call

          Instead of failing the complete ExecutionGraph, a failing scheduleOrUpdateConsumers call
          will be reported back to the caller. The caller can then decide what to do. Per default,
          it will fail the calling task.

          Adapt TaskManagerTest


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/2701 [backport] FLINK-4933 [exec graph] Don't let the EG fail in case of a failing scheduleOrUpdateConsumers call This is a backport for the release-1.1 branch. The only thing adapted is the added test case. Instead of failing the complete ExecutionGraph, a failing scheduleOrUpdateConsumers call will be reported back to the caller. The caller can then decide what to do. Per default, it will fail the calling task. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink backportFixScheduleOrUpdateConsumers Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2701.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2701 commit b17fada0e60ad9986680c89effa628944973d999 Author: Till Rohrmann <trohrmann@apache.org> Date: 2016-10-27T09:41:29Z FLINK-4933 [exec graph] Don't let the EG fail in case of a failing scheduleOrUpdateConsumers call Instead of failing the complete ExecutionGraph, a failing scheduleOrUpdateConsumers call will be reported back to the caller. The caller can then decide what to do. Per default, it will fail the calling task. Adapt TaskManagerTest
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user uce commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2700#discussion_r85533357

          — Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala —
          @@ -917,8 +917,15 @@ class JobManager(
          case ScheduleOrUpdateConsumers(jobId, partitionId) =>
          currentJobs.get(jobId) match {
          case Some((executionGraph, _)) =>

          • sender ! decorateMessage(Acknowledge)
          • executionGraph.scheduleOrUpdateConsumers(partitionId)
            + try { + executionGraph.scheduleOrUpdateConsumers(partitionId) + sender ! decorateMessage(Acknowledge) + }

            catch {
            + case e: ExecutionGraphException =>

              • End diff –

          Does it make sense to catch the more generic `Exception` type here in order to make the sender notice any problems sooner? I see that the method only throws EGExceptions currently but maybe at some point in time someone introduces a runtime exception etc. This would only be logged at the JM and the task's ask would timeout.

          Show
          githubbot ASF GitHub Bot added a comment - Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/2700#discussion_r85533357 — Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala — @@ -917,8 +917,15 @@ class JobManager( case ScheduleOrUpdateConsumers(jobId, partitionId) => currentJobs.get(jobId) match { case Some((executionGraph, _)) => sender ! decorateMessage(Acknowledge) executionGraph.scheduleOrUpdateConsumers(partitionId) + try { + executionGraph.scheduleOrUpdateConsumers(partitionId) + sender ! decorateMessage(Acknowledge) + } catch { + case e: ExecutionGraphException => End diff – Does it make sense to catch the more generic `Exception` type here in order to make the sender notice any problems sooner? I see that the method only throws EGExceptions currently but maybe at some point in time someone introduces a runtime exception etc. This would only be logged at the JM and the task's ask would timeout.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/2700

          Thanks for the review @uce. This is a problem which is relevant for the current master. It could make the `ExecutionGraph` go into state `FAILED` when being in state `RESTARTING` in case that the `scheduleOrUpdateConsumers` call failed.

          Will rebase the PR because it was based on the failing `SpanningRecordSerializerTest`.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2700 Thanks for the review @uce. This is a problem which is relevant for the current master. It could make the `ExecutionGraph` go into state `FAILED` when being in state `RESTARTING` in case that the `scheduleOrUpdateConsumers` call failed. Will rebase the PR because it was based on the failing `SpanningRecordSerializerTest`.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/2701

          Will rebase, because this PR is based on the release-1.1 branch which contained the failing `SpanningRecordSerializerTest`.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2701 Will rebase, because this PR is based on the release-1.1 branch which contained the failing `SpanningRecordSerializerTest`.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/2700

          Looks good to me, +1

          Merging this...

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2700 Looks good to me, +1 Merging this...
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/2701

          Looks good, +1

          Merging this...

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2701 Looks good, +1 Merging this...
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/2700

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/2700
          Hide
          StephanEwen Stephan Ewen added a comment -

          Fixed in

          • 1.2.0 via 649f9578bce56e118ec973218271262286c534a9
          • 1.1.4 via d941b50db4c58c7ea3d5c1d888d34ea8975407e0
          Show
          StephanEwen Stephan Ewen added a comment - Fixed in 1.2.0 via 649f9578bce56e118ec973218271262286c534a9 1.1.4 via d941b50db4c58c7ea3d5c1d888d34ea8975407e0
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/2701

          Has been merged to the release branch 1.1.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2701 Has been merged to the release branch 1.1.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann closed the pull request at:

          https://github.com/apache/flink/pull/2701

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann closed the pull request at: https://github.com/apache/flink/pull/2701

            People

            • Assignee:
              till.rohrmann Till Rohrmann
              Reporter:
              till.rohrmann Till Rohrmann
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development