Uploaded image for project: 'Slider'
  1. Slider
  2. SLIDER-611 Über-JIRA : placement phase 2
  3. SLIDER-799

AM to decide when to relax placement policy from specific host to rack/cluster

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: Slider 0.70
    • Fix Version/s: Slider 0.80
    • Component/s: appmaster
    • Labels:
      None
    • Sprint:
      Slider Feb #1, Slider April #1

      Description

      If Slider asks for relaxed affinity, YARN only gives it ~1 second for free capacity to appear on a node before it falls back to non-local assignment. While this is OK for analytics throughput, it's suboptimal for placement of code such as HBase region servers.

      AM needs to take charge of the placement and decide for itself when to convert from placed to relaxed.

        Issue Links

          Activity

          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit f5f837cc74d008becac2a663c13753e65e8a32b8 in incubator-slider's branch refs/heads/develop from Steve Loughran
          [ https://git-wip-us.apache.org/repos/asf?p=incubator-slider.git;h=f5f837c ]

          Merge branch 'feature/SLIDER-832Jenkins_failing_afterSLIDER-799_merge' into develop

          Show
          jira-bot ASF subversion and git services added a comment - Commit f5f837cc74d008becac2a663c13753e65e8a32b8 in incubator-slider's branch refs/heads/develop from Steve Loughran [ https://git-wip-us.apache.org/repos/asf?p=incubator-slider.git;h=f5f837c ] Merge branch 'feature/ SLIDER-832 Jenkins_failing_after SLIDER-799 _merge' into develop
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 5f397d7da213aa319cec1257e4d8a1c106f1b6d1 in incubator-slider's branch refs/heads/develop from Steve Loughran
          [ https://git-wip-us.apache.org/repos/asf?p=incubator-slider.git;h=5f397d7 ]

          Merge branch 'feature/SLIDER-799-AM-managed-relax' into develop

          Show
          jira-bot ASF subversion and git services added a comment - Commit 5f397d7da213aa319cec1257e4d8a1c106f1b6d1 in incubator-slider's branch refs/heads/develop from Steve Loughran [ https://git-wip-us.apache.org/repos/asf?p=incubator-slider.git;h=5f397d7 ] Merge branch 'feature/ SLIDER-799 -AM-managed-relax' into develop
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit b952b6401a22c65053c85a6a1238f1928d3eb243 in incubator-slider's branch refs/heads/feature/SLIDER-799-AM-managed-relax from Steve Loughran
          [ https://git-wip-us.apache.org/repos/asf?p=incubator-slider.git;h=b952b64 ]

          SLIDER-799 SLIDER-817 request tracker builds cancel operation from the resource used in the request...tests updated to handle the changes

          Show
          jira-bot ASF subversion and git services added a comment - Commit b952b6401a22c65053c85a6a1238f1928d3eb243 in incubator-slider's branch refs/heads/feature/ SLIDER-799 -AM-managed-relax from Steve Loughran [ https://git-wip-us.apache.org/repos/asf?p=incubator-slider.git;h=b952b64 ] SLIDER-799 SLIDER-817 request tracker builds cancel operation from the resource used in the request...tests updated to handle the changes
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 43c61fbba2ee448f2797790aaaf52ddaaf9ac6f5 in incubator-slider's branch refs/heads/feature/SLIDER-799-AM-managed-relax from Steve Loughran
          [ https://git-wip-us.apache.org/repos/asf?p=incubator-slider.git;h=43c61fb ]

          SLIDER-799 SLIDER-817 track unplaced outstanding requests

          Show
          jira-bot ASF subversion and git services added a comment - Commit 43c61fbba2ee448f2797790aaaf52ddaaf9ac6f5 in incubator-slider's branch refs/heads/feature/ SLIDER-799 -AM-managed-relax from Steve Loughran [ https://git-wip-us.apache.org/repos/asf?p=incubator-slider.git;h=43c61fb ] SLIDER-799 SLIDER-817 track unplaced outstanding requests
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 7d9a9e942ead8343bc4e2c52419c1b258292ca15 in incubator-slider's branch refs/heads/feature/SLIDER-799-AM-managed-relax from Steve Loughran
          [ https://git-wip-us.apache.org/repos/asf?p=incubator-slider.git;h=7d9a9e9 ]

          SLIDER-799 track outcome of allocation: whether an assignment was "open", "placed", or "escalated"; this info is included in serialized/JSON views of container state so can be retrieved by client APIs

          Show
          jira-bot ASF subversion and git services added a comment - Commit 7d9a9e942ead8343bc4e2c52419c1b258292ca15 in incubator-slider's branch refs/heads/feature/ SLIDER-799 -AM-managed-relax from Steve Loughran [ https://git-wip-us.apache.org/repos/asf?p=incubator-slider.git;h=7d9a9e9 ] SLIDER-799 track outcome of allocation: whether an assignment was "open", "placed", or "escalated"; this info is included in serialized/JSON views of container state so can be retrieved by client APIs
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit ad41b2444c68a27ab9a5d1a10470da69f74f7bb6 in incubator-slider's branch refs/heads/feature/SLIDER-799-AM-managed-relax from Steve Loughran
          [ https://git-wip-us.apache.org/repos/asf?p=incubator-slider.git;h=ad41b24 ]

          SLIDER-799 tests are all working

          Show
          jira-bot ASF subversion and git services added a comment - Commit ad41b2444c68a27ab9a5d1a10470da69f74f7bb6 in incubator-slider's branch refs/heads/feature/ SLIDER-799 -AM-managed-relax from Steve Loughran [ https://git-wip-us.apache.org/repos/asf?p=incubator-slider.git;h=ad41b24 ] SLIDER-799 tests are all working
          Hide
          stevel@apache.org Steve Loughran added a comment -

          If we are escalating, we could also consider

          1. having a rack-local escalation before going cluster wide. This reduces the cost of fetching blocks from the previous host (assuming it is up & has all the blocks local).
          2. maybe even falling back to other labels/queues. This is trickier and could lead to cluster admins dealing with support problems like 'why is my hbase master not running on a node of a given label?'. I think I'd rather have the component request unsatisfied and let those admins add new nodes to the label set explicitly.
          Show
          stevel@apache.org Steve Loughran added a comment - If we are escalating, we could also consider having a rack-local escalation before going cluster wide. This reduces the cost of fetching blocks from the previous host (assuming it is up & has all the blocks local). maybe even falling back to other labels/queues. This is trickier and could lead to cluster admins dealing with support problems like 'why is my hbase master not running on a node of a given label?'. I think I'd rather have the component request unsatisfied and let those admins add new nodes to the label set explicitly.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Implementation strategy

          1. OutstandingRequest instances add timestamp and "requestRelaxed" flag
          2. timestamp set from RoleHistory.now() to make it possible for tests to override.
          3. OutstandingRequestTracker continues to track requests; continues to remove entries when a request is satisfied on a nominated node.
          4. Only now it can also enum all requests that are > a specific timeout
          5. Caller than then "relax" them: cancel the existing request, re-issue with a relaxed flag (i.e. the alternate YARN priority).
          6. The outstanding request will remain in the queue, only now marked to show how placement has been relaxed.

          This will need some other changes

          • Some heartbeat event to trigger a relaxation scan, cancel outstanding requests and re-issue new ones. This is a bit like review-and-request, except now it's cancel-then-re-request.
          • need enough state preserved in OutstandingRequest to enable new request to be rebuilt. (e.g YARN requirements)
          • could create a new risk of a race condition, assignment event comes in while/before the new request has been issued.
          Show
          stevel@apache.org Steve Loughran added a comment - Implementation strategy OutstandingRequest instances add timestamp and "requestRelaxed" flag timestamp set from RoleHistory.now() to make it possible for tests to override. OutstandingRequestTracker continues to track requests; continues to remove entries when a request is satisfied on a nominated node. Only now it can also enum all requests that are > a specific timeout Caller than then "relax" them: cancel the existing request, re-issue with a relaxed flag (i.e. the alternate YARN priority). The outstanding request will remain in the queue, only now marked to show how placement has been relaxed. This will need some other changes Some heartbeat event to trigger a relaxation scan, cancel outstanding requests and re-issue new ones. This is a bit like review-and-request, except now it's cancel-then-re-request. need enough state preserved in OutstandingRequest to enable new request to be rebuilt. (e.g YARN requirements) could create a new risk of a race condition, assignment event comes in while/before the new request has been issued.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          SLIDER-611 depends on this feature

          Show
          stevel@apache.org Steve Loughran added a comment - SLIDER-611 depends on this feature

            People

            • Assignee:
              stevel@apache.org Steve Loughran
              Reporter:
              stevel@apache.org Steve Loughran
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 24h Original Estimate - 24h
                24h
                Remaining:
                Time Spent - 24h Remaining Estimate - 8h
                8h
                Logged:
                Time Spent - 24h Remaining Estimate - 8h
                24h

                  Development

                    Agile