Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-5836

Race condition between slot offering and task deployment

    Details

      Description

      The Flip-6 code has a race condition when offering slots to a JobManager which directly deploys tasks to the offered slots. In such a situation it is possible that the deploy call overtakes the acknowledge message for the slot offering. As a result, the slots are not marked yet as active and the deployment will fail.

      I propose to fix this problem by first activating all offered slots before sending the slot offer message to the JobManager. Consequently, we'll deactivate and free slots which haven't been accepted by the JobManager once we've received the offering acknowledge message.

        Issue Links

          Activity

          Hide
          SleePy Biao Liu added a comment - - edited

          +1
          I think we have encountered this problem.
          If nobody is working on this, I'll take it.

          Show
          SleePy Biao Liu added a comment - - edited +1 I think we have encountered this problem. If nobody is working on this, I'll take it.
          Hide
          till.rohrmann Till Rohrmann added a comment -

          Great to hear Biao Liu

          Show
          till.rohrmann Till Rohrmann added a comment - Great to hear Biao Liu
          Hide
          SleePy Biao Liu added a comment - - edited

          Wenlong Lyu has already been working on it, reassign to him.

          Show
          SleePy Biao Liu added a comment - - edited Wenlong Lyu has already been working on it, reassign to him.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user wenlong88 opened a pull request:

          https://github.com/apache/flink/pull/3371

          FLINK-5836 Fix race condition between offer slot and submit task

          The solution is the same as what till described in jira: activating the slots when reserving them on `TaskExecutor` before offering to `JobManager`

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/wenlong88/flink jira-5836

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3371.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3371


          commit 35042f29e055a7f83b7c4d79e4c72673711dfd78
          Author: wenlong.lwl <wenlong.lwl@alibaba-inc.com>
          Date: 2017-01-06T08:32:08Z

          Fix race condition between offer slot and submit task


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user wenlong88 opened a pull request: https://github.com/apache/flink/pull/3371 FLINK-5836 Fix race condition between offer slot and submit task The solution is the same as what till described in jira: activating the slots when reserving them on `TaskExecutor` before offering to `JobManager` You can merge this pull request into a Git repository by running: $ git pull https://github.com/wenlong88/flink jira-5836 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3371.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3371 commit 35042f29e055a7f83b7c4d79e4c72673711dfd78 Author: wenlong.lwl <wenlong.lwl@alibaba-inc.com> Date: 2017-01-06T08:32:08Z Fix race condition between offer slot and submit task
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user wenlong88 commented on the issue:

          https://github.com/apache/flink/pull/3371

          hi @tillrohrmann, the CI is failing because of my changes, could you help take a look it.

          Show
          githubbot ASF GitHub Bot added a comment - Github user wenlong88 commented on the issue: https://github.com/apache/flink/pull/3371 hi @tillrohrmann, the CI is failing because of my changes, could you help take a look it.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/3371

          Btw: The failing travis build is due to exceeding the build time.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/3371 Btw: The failing travis build is due to exceeding the build time.
          Hide
          till.rohrmann Till Rohrmann added a comment -

          Fixed via d6aed38b3a15946d383d762030b5f5c1418388de

          Show
          till.rohrmann Till Rohrmann added a comment - Fixed via d6aed38b3a15946d383d762030b5f5c1418388de
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/3371

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/3371

            People

            • Assignee:
              wenlong.lwl Wenlong Lyu
              Reporter:
              till.rohrmann Till Rohrmann
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development