Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-5499

Try to reuse the resource location of prior execution attempt in allocating slot

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Component/s: JobManager
    • Labels:
      None

      Description

      Currently when schedule execution to request to allocate slot from SlotPool, the TaskManagerLocation parameter is empty collection. So for task fail over scenario, the new execution attempt may be deployed to different task managers. If setting rockDB as state backend, the performance is better if the data can be restored from local machine. So we try to reuse the TaskManagerLocation of prior execution attempt when allocating slot from SlotPool. If the TaskManagerLocation is empty from prior executions, the behavior is the same with current status.

        Issue Links

          Activity

          Hide
          StephanEwen Stephan Ewen added a comment -

          Implemented in
          2e107b1cfaa6e31fe478191c74aa25d53ab49943
          and
          b9ed4ff151c5d3a64be395c660160b5619e32c7f

          Show
          StephanEwen Stephan Ewen added a comment - Implemented in 2e107b1cfaa6e31fe478191c74aa25d53ab49943 and b9ed4ff151c5d3a64be395c660160b5619e32c7f
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/3125

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/3125
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3125

          I have actually merged this with slight adjustments to take both state location and prior inputs into account. Since batch jobs are so far stateless, this will preserve input locality for batch jobs and for the first time a streaming job is scheduled. For state resuming jobs, it will try to reuse the prior location.

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3125 I have actually merged this with slight adjustments to take both state location and prior inputs into account. Since batch jobs are so far stateless, this will preserve input locality for batch jobs and for the first time a streaming job is scheduled. For state resuming jobs, it will try to reuse the prior location.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user wangzhijiang999 commented on the issue:

          https://github.com/apache/flink/pull/3125

          @StephanEwen
          Yes, the current concern is only focusing on state restore performance. This PR does not consider all the scenarios and it may be only the first step for the slot location implementation.

          If the location do not exist, it can add other strategies to decide the locations, such as co-loated by input for batch job as you mentioned. And it can be the second step for the implementation.

          Wish your further comments!

          Show
          githubbot ASF GitHub Bot added a comment - Github user wangzhijiang999 commented on the issue: https://github.com/apache/flink/pull/3125 @StephanEwen Yes, the current concern is only focusing on state restore performance. This PR does not consider all the scenarios and it may be only the first step for the slot location implementation. If the location do not exist, it can add other strategies to decide the locations, such as co-loated by input for batch job as you mentioned. And it can be the second step for the implementation. Wish your further comments!
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3125

          Thank you for opening this pull request. I think the feature is a good idea, but I would like to approach it a bit broader:

          • On state restore, this should prefer the old state location, agreed
          • If no such location exists, it should still try to co-locate by input. Especially for the batch execution, that is quite important.

          Also, this would need some tests.
          I'll add some more detailed comments to the issue soon...

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3125 Thank you for opening this pull request. I think the feature is a good idea, but I would like to approach it a bit broader: On state restore, this should prefer the old state location, agreed If no such location exists, it should still try to co-locate by input. Especially for the batch execution, that is quite important. Also, this would need some tests. I'll add some more detailed comments to the issue soon...
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user wangzhijiang999 opened a pull request:

          https://github.com/apache/flink/pull/3125

          FLINK-5499[JobManager]Reuse the resource location of prior executio…

          Currently when schedule execution to request to allocate slot from *SlotPool, the **TaskManagerLocation* parameter is empty collection. So for task fail over scenario, the new execution attempt may be deployed to different task managers. If setting rockDB as state backend, the performance is better if the data can be restored from local machine. So we try to reuse the *TaskManagerLocation* of prior execution attempt when allocating slot from *SlotPool. If the **TaskManagerLocation* is empty from prior executions, the behavior is the same with current status.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/wangzhijiang999/flink FLINK-5499

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3125.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3125


          commit ab2e24ae7e82be45359f249670f72664226ec18c
          Author: 淘江 <taojiang.wzj@alibaba-inc.com>
          Date: 2017-01-16T09:28:19Z

          FLINK-5499[JobManager]Reuse the resource location of prior execution attempt in allocating slot


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user wangzhijiang999 opened a pull request: https://github.com/apache/flink/pull/3125 FLINK-5499 [JobManager] Reuse the resource location of prior executio… Currently when schedule execution to request to allocate slot from * SlotPool , the **TaskManagerLocation * parameter is empty collection. So for task fail over scenario, the new execution attempt may be deployed to different task managers. If setting rockDB as state backend, the performance is better if the data can be restored from local machine. So we try to reuse the * TaskManagerLocation * of prior execution attempt when allocating slot from * SlotPool . If the **TaskManagerLocation * is empty from prior executions, the behavior is the same with current status. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangzhijiang999/flink FLINK-5499 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3125.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3125 commit ab2e24ae7e82be45359f249670f72664226ec18c Author: 淘江 <taojiang.wzj@alibaba-inc.com> Date: 2017-01-16T09:28:19Z FLINK-5499 [JobManager] Reuse the resource location of prior execution attempt in allocating slot

            People

            • Assignee:
              zjwang zhijiang
              Reporter:
              zjwang zhijiang
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development