Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-4932

Don't let ExecutionGraph fail when in state Restarting

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.2.0, 1.1.4
    • Fix Version/s: 1.2.0, 1.1.4
    • Labels:
      None

      Description

      When in state RESTARTING it is possible to fail the ExecutionGraph. This should not be possible, since in state RESTARTING there should be no action performed which can fail the ExecutionGraph.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tillrohrmann opened a pull request:

          https://github.com/apache/flink/pull/2710

          FLINK-4932 [exec graph] Failing in state RESTARTING only fails the EG if no more restarts are possible

          If in state RESTARTING a failure occurs (`ExecutionGraph.fail` is called), then a new restart attempt is started. Only if the restart strategy no longer allows further restarts or if the thrown exception is of type `SuppressRestartsException` a job can go from RESTARTING into FAILED.

          @StephanEwen would be great if you could review this PR.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tillrohrmann/flink fixRestartingState

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/2710.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2710


          commit 9d96a799ee4e0f6b77cf5ae97d270021242a5462
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2016-10-27T16:32:08Z

          FLINK-4932 [exec graph] Failing in state RESTARTING only fails the EG if no more restarts are possible

          If in state RESTARTING a failure occurs, then a new restart attempt is started. Only if the
          restart strategy no longer allows further restarts or if the thrown exception is of type
          SuppressRestartsException a job can go from RESTARTING into FAILED.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/2710 FLINK-4932 [exec graph] Failing in state RESTARTING only fails the EG if no more restarts are possible If in state RESTARTING a failure occurs (`ExecutionGraph.fail` is called), then a new restart attempt is started. Only if the restart strategy no longer allows further restarts or if the thrown exception is of type `SuppressRestartsException` a job can go from RESTARTING into FAILED. @StephanEwen would be great if you could review this PR. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink fixRestartingState Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2710.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2710 commit 9d96a799ee4e0f6b77cf5ae97d270021242a5462 Author: Till Rohrmann <trohrmann@apache.org> Date: 2016-10-27T16:32:08Z FLINK-4932 [exec graph] Failing in state RESTARTING only fails the EG if no more restarts are possible If in state RESTARTING a failure occurs, then a new restart attempt is started. Only if the restart strategy no longer allows further restarts or if the thrown exception is of type SuppressRestartsException a job can go from RESTARTING into FAILED.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tillrohrmann opened a pull request:

          https://github.com/apache/flink/pull/2711

          [backport] FLINK-4932 [exec graph] Failing in state RESTARTING only fails the EG if no more restarts are possible

          This PR is a backport of #2710 for the release-1.1 branch.

          If in state RESTARTING a failure occurs (`ExecutionGraph.fail` is called), then a new restart attempt is started. Only if the restart strategy no longer allows further restarts or if the thrown exception is of type SuppressRestartsException a job can go from RESTARTING into FAILED.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tillrohrmann/flink backportFixRestartingState

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/2711.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2711


          commit 920dbe6ac5969c63d9725f0999c3cd1adfa70fad
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2016-10-27T16:32:08Z

          FLINK-4932 [exec graph] Failing in state RESTARTING only fails the EG if no more restarts are possible

          If in state RESTARTING a failure occurs, then a new restart attempt is started. Only if the
          restart strategy no longer allows further restarts or if the thrown exception is of type
          SuppressRestartsException a job can go from RESTARTING into FAILED.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/2711 [backport] FLINK-4932 [exec graph] Failing in state RESTARTING only fails the EG if no more restarts are possible This PR is a backport of #2710 for the release-1.1 branch. If in state RESTARTING a failure occurs (`ExecutionGraph.fail` is called), then a new restart attempt is started. Only if the restart strategy no longer allows further restarts or if the thrown exception is of type SuppressRestartsException a job can go from RESTARTING into FAILED. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink backportFixRestartingState Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2711.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2711 commit 920dbe6ac5969c63d9725f0999c3cd1adfa70fad Author: Till Rohrmann <trohrmann@apache.org> Date: 2016-10-27T16:32:08Z FLINK-4932 [exec graph] Failing in state RESTARTING only fails the EG if no more restarts are possible If in state RESTARTING a failure occurs, then a new restart attempt is started. Only if the restart strategy no longer allows further restarts or if the thrown exception is of type SuppressRestartsException a job can go from RESTARTING into FAILED.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user uce commented on the issue:

          https://github.com/apache/flink/pull/2711

          Code changes and updated figure look good to me. +1 to merge after Travis passes. It makes sense that for instance a job with infinite restarts will only be failed if the job is explicitly suspended or forced to suppress restarts. I was curious how you noticed this issue? Did a user run into this?

          Show
          githubbot ASF GitHub Bot added a comment - Github user uce commented on the issue: https://github.com/apache/flink/pull/2711 Code changes and updated figure look good to me. +1 to merge after Travis passes. It makes sense that for instance a job with infinite restarts will only be failed if the job is explicitly suspended or forced to suppress restarts. I was curious how you noticed this issue? Did a user run into this?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/2711

          Thanks for the review @uce. @StephanEwen actually discovered this problem. The problem was related to #2700. When the scheduleOrUpdateConsumers call was triggered when in state restarting due to a late rpc, it would fail the complete `ExecutionGraph`, because it could not find the respective `Execution`.

          Will rebase the PR, because release-1.1 contained a problem which prevented it from building.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2711 Thanks for the review @uce. @StephanEwen actually discovered this problem. The problem was related to #2700. When the scheduleOrUpdateConsumers call was triggered when in state restarting due to a late rpc, it would fail the complete `ExecutionGraph`, because it could not find the respective `Execution`. Will rebase the PR, because release-1.1 contained a problem which prevented it from building.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/2710

          Good fix! Merging this...

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2710 Good fix! Merging this...
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/2711

          Thanks for the fix!
          Looks good, merging this...

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2711 Thanks for the fix! Looks good, merging this...
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/2710

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/2710
          Hide
          StephanEwen Stephan Ewen added a comment -

          Fixed in

          • 1.2.0 via 18507de3c068795c93c5c689388a22857f2f817c
          • 1.1.4 via ac82e3d05e895f74ecf41da489068a2997415d3d
          Show
          StephanEwen Stephan Ewen added a comment - Fixed in 1.2.0 via 18507de3c068795c93c5c689388a22857f2f817c 1.1.4 via ac82e3d05e895f74ecf41da489068a2997415d3d
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/2711

          Closing this because it has been merged into the release 1.1 branch.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2711 Closing this because it has been merged into the release 1.1 branch.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann closed the pull request at:

          https://github.com/apache/flink/pull/2711

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann closed the pull request at: https://github.com/apache/flink/pull/2711

            People

            • Assignee:
              till.rohrmann Till Rohrmann
              Reporter:
              till.rohrmann Till Rohrmann
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development