Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-5197

Late JobStatusChanged messages can interfere with running jobs

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.2.0, 1.1.3
    • Fix Version/s: 1.2.0, 1.1.4
    • Component/s: JobManager
    • Labels:
      None

      Description

      When the JobManager receives a JobStatusChanged message, it will look up the ExecutionGraph for the given JobID. If there is no ExecutionGraph, then a RemoveJob message is sent to itself. In the general case, this is not problematic, because the RemoveJob message won't do anything if there is no ExecutionGraph. However, since this is an asynchronous call, it can happen that the corresponding job of the JobID is recovered before receiving the RemoveJob message. In this case, the newly recovered job would be removed.

      I propose to change the behaviour such that a JobStatusChanged for a non-existing ExecutionGraph will be simply ignored.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tillrohrmann opened a pull request:

          https://github.com/apache/flink/pull/2895

          FLINK-5197 [jm] Ignore outdated JobStatusChanged messages

          Outdated JobStatusChanged messages no longer trigger a RemoveJob message but are
          logged and ignored. This has the advantage, that an outdated JobStatusChanged message
          cannot interfere with a recovered job which can have the same job id.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tillrohrmann/flink fixJobStatusChangedMessage

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/2895.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2895


          commit 490cef46380178a2296c2f743b9eb91154967463
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2016-11-29T15:02:29Z

          FLINK-5197 [jm] Ignore outdated JobStatusChanged messages

          Outdated JobStatusChanged messages no longer trigger a RemoveJob message but are
          logged and ignored. This has the advantage, that an outdated JobStatusChanged message
          cannot interfere with a recovered job which can have the same job id.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/2895 FLINK-5197 [jm] Ignore outdated JobStatusChanged messages Outdated JobStatusChanged messages no longer trigger a RemoveJob message but are logged and ignored. This has the advantage, that an outdated JobStatusChanged message cannot interfere with a recovered job which can have the same job id. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink fixJobStatusChangedMessage Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2895.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2895 commit 490cef46380178a2296c2f743b9eb91154967463 Author: Till Rohrmann <trohrmann@apache.org> Date: 2016-11-29T15:02:29Z FLINK-5197 [jm] Ignore outdated JobStatusChanged messages Outdated JobStatusChanged messages no longer trigger a RemoveJob message but are logged and ignored. This has the advantage, that an outdated JobStatusChanged message cannot interfere with a recovered job which can have the same job id.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tillrohrmann opened a pull request:

          https://github.com/apache/flink/pull/2896

          FLINK-5197 [jm] Ignore outdated JobStatusChanged messages

          Backport of #2895 for release 1.1 branch.

          Outdated JobStatusChanged messages no longer trigger a RemoveJob message but are
          logged and ignored. This has the advantage, that an outdated JobStatusChanged message
          cannot interfere with a recovered job which can have the same job id.

          Review @uce.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tillrohrmann/flink backportFixJobStatusChangedMessage

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/2896.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2896


          commit 4a2f948224fe628c721adc4fae24199b0296c80f
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2016-11-29T15:02:29Z

          FLINK-5197 [jm] Ignore outdated JobStatusChanged messages

          Outdated JobStatusChanged messages no longer trigger a RemoveJob message but are
          logged and ignored. This has the advantage, that an outdated JobStatusChanged message
          cannot interfere with a recovered job which can have the same job id.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/2896 FLINK-5197 [jm] Ignore outdated JobStatusChanged messages Backport of #2895 for release 1.1 branch. Outdated JobStatusChanged messages no longer trigger a RemoveJob message but are logged and ignored. This has the advantage, that an outdated JobStatusChanged message cannot interfere with a recovered job which can have the same job id. Review @uce. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink backportFixJobStatusChangedMessage Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2896.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2896 commit 4a2f948224fe628c721adc4fae24199b0296c80f Author: Till Rohrmann <trohrmann@apache.org> Date: 2016-11-29T15:02:29Z FLINK-5197 [jm] Ignore outdated JobStatusChanged messages Outdated JobStatusChanged messages no longer trigger a RemoveJob message but are logged and ignored. This has the advantage, that an outdated JobStatusChanged message cannot interfere with a recovered job which can have the same job id.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user uce commented on the issue:

          https://github.com/apache/flink/pull/2896

          Good catch! The `RemoveJob` cannot succeed since is also checking the `currentJobs` that are checked for `JobStatusChanged` already. So in the end, the only case where this actually triggers removal is when it interfers with a recovered job as you say. 😨

          +1 to merge for 1.1 and #2895 for 1.2.

          Show
          githubbot ASF GitHub Bot added a comment - Github user uce commented on the issue: https://github.com/apache/flink/pull/2896 Good catch! The `RemoveJob` cannot succeed since is also checking the `currentJobs` that are checked for `JobStatusChanged` already. So in the end, the only case where this actually triggers removal is when it interfers with a recovered job as you say. 😨 +1 to merge for 1.1 and #2895 for 1.2.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/2896

          Thanks for the review @uce. Failing test cases are unrelated. Merging this PR.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2896 Thanks for the review @uce. Failing test cases are unrelated. Merging this PR.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/2895

          Merging this PR. @uce reviewed the backport.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2895 Merging this PR. @uce reviewed the backport.
          Hide
          till.rohrmann Till Rohrmann added a comment -

          1.2.0 fixed via ae0975c16b997fe792790216820a559d01a01894
          1.1.4 fixed via 569a9666fca9d9113d9fc7f0382faf986afb036f

          Show
          till.rohrmann Till Rohrmann added a comment - 1.2.0 fixed via ae0975c16b997fe792790216820a559d01a01894 1.1.4 fixed via 569a9666fca9d9113d9fc7f0382faf986afb036f
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann closed the pull request at:

          https://github.com/apache/flink/pull/2896

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann closed the pull request at: https://github.com/apache/flink/pull/2896
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/2895

          Forgot to include the closing tag in the commit.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2895 Forgot to include the closing tag in the commit.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann closed the pull request at:

          https://github.com/apache/flink/pull/2895

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann closed the pull request at: https://github.com/apache/flink/pull/2895

            People

            • Assignee:
              till.rohrmann Till Rohrmann
              Reporter:
              till.rohrmann Till Rohrmann
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development