Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-1766

A better algorithm server rack selection for RAS

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: 2.0.0, 1.1.0
    • Component/s: None
    • Labels:
      None

      Description

      Currently the getBestClustering algorithm for RAS finds the "Best" cluster/rack based on which rack has the most available resources this may be insufficient and may cause topologies not to be able to be scheduled successfully even though there are enough resources to schedule it in the cluster. We attempt to find the rack with the most resources by find the rack with the biggest sum of available memory + available cpu. This method is not effective since it does not consider the number of slots available. This method also fails in identifying racks that are not schedulable due to the exhaustion of one of the resources either memory, cpu, or slots. The current implementation also tries the initial scheduling on one rack and not try to schedule on all the racks before giving up which may cause topologies to be failed to be scheduled due to the above mentioned shortcomings in the current method. Also the current method does not consider failures of workers. When executors of a topology gets unassigned and needs to be scheduled again, the current logic in getBestClustering may be inadequate if not complete wrong. When executors needs to rescheduled due to a fault, getBestClustering will likely return a cluster that is different from where the majority of executors from the topology is originally scheduling in.

      Thus, I propose a different strategy/algorithm to find the "best" cluster. I have come up with a ordering strategy I dub subordinate resource availability ordering (inspired by Dominant Resource Fairness) that sorts racks by the subordinate (not dominant) resource availability.

      For example given 4 racks with the following resource availabilities

      //generate some that has alot of memory but little of cpu
      rack-3 Avail [ CPU 100.0 MEM 200000.0 Slots 40 ] Total [ CPU 100.0 MEM 200000.0 Slots 40 ]
      //generate some supervisors that are depleted of one resource
      rack-2 Avail [ CPU 0.0 MEM 80000.0 Slots 40 ] Total [ CPU 0.0 MEM 80000.0 Slots 40 ]
      //generate some that has a lot of cpu but little of memory
      rack-4 Avail [ CPU 6100.0 MEM 10000.0 Slots 40 ] Total [ CPU 6100.0 MEM 10000.0 Slots 40 ]
      //generate another rack of supervisors with less resources than rack-0
      rack-1 Avail [ CPU 2000.0 MEM 40000.0 Slots 40 ] Total [ CPU 2000.0 MEM 40000.0 Slots 40 ]
      rack-0 Avail [ CPU 4000.0 MEM 80000.0 Slots 40( ] Total [ CPU 4000.0 MEM 80000.0 Slots 40 ]
      Cluster Overall Avail [ CPU 12200.0 MEM 410000.0 Slots 200 ] Total [ CPU 12200.0 MEM 410000.0 Slots 200 ]
      

      It is clear that rack-0 is the best cluster since its the most balanced and can potentially schedule the most executors, while rack-2 is the worst rack since rack-2 is depleted of cpu resource thus rendering it unschedulable even though there are other resources available.

      We first calculate the resource availability percentage of all the racks for each resource by computing:

      (resource available on rack) / (resource available in cluster)
      

      We do this calculation to normalize the values otherwise the resource values would not be comparable.

      So for our example:

      rack-3 Avail [ CPU 0.819672131147541% MEM 48.78048780487805% Slots 20.0% ] effective resources: 0.00819672131147541
      rack-2 Avail [ 0.0% MEM 19.51219512195122% Slots 20.0% ] effective resources: 0.0
      rack-4 Avail [ CPU 50.0% MEM 2.4390243902439024% Slots 20.0% ] effective resources: 0.024390243902439025
      rack-1 Avail [ CPU 16.39344262295082% MEM 9.75609756097561% Slots 20.0% ] effective resources: 0.0975609756097561
      rack-0 Avail [ CPU 32.78688524590164% MEM 19.51219512195122% Slots 20.0% ] effective resources: 0.1951219512195122
      

      The effective resource of a rack, which is also the subordinate resource, is computed by:

      MIN(resource availability percentage of {CPU, Memory, # of free Slots}).
      

      Then we order the racks by the effective resource.

      Thus for our example:

      Sorted rack: [rack-0, rack-1, rack-4, rack-3, rack-2]
      

      Also to deal with the presence of failures, if a topology is partially scheduled, we find the rack with the most scheduled executors for the topology and we try to schedule on that rack first.

      Thus for the sorting for racks. We first sort by the number of executors already scheduled on the rack and then by the subordinate resource availability.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user jerrypeng opened a pull request:

          https://github.com/apache/storm/pull/1398

          STORM-1766 - A better algorithm server rack selection for RAS

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/jerrypeng/storm STORM-1766

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/storm/pull/1398.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #1398


          commit 848ef21df184d046ac6617ea2bce1efc00e13a13
          Author: Boyang Jerry Peng <jerrypeng@yahoo-inc.com>
          Date: 2016-05-04T22:08:57Z

          STORM-1766 - A better algorithm server rack selection for RAS


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user jerrypeng opened a pull request: https://github.com/apache/storm/pull/1398 STORM-1766 - A better algorithm server rack selection for RAS You can merge this pull request into a Git repository by running: $ git pull https://github.com/jerrypeng/storm STORM-1766 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/storm/pull/1398.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1398 commit 848ef21df184d046ac6617ea2bce1efc00e13a13 Author: Boyang Jerry Peng <jerrypeng@yahoo-inc.com> Date: 2016-05-04T22:08:57Z STORM-1766 - A better algorithm server rack selection for RAS
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user knusbaum commented on the pull request:

          https://github.com/apache/storm/pull/1398#issuecomment-219518790

          +1

          Show
          githubbot ASF GitHub Bot added a comment - Github user knusbaum commented on the pull request: https://github.com/apache/storm/pull/1398#issuecomment-219518790 +1
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user redsanket commented on a diff in the pull request:

          https://github.com/apache/storm/pull/1398#discussion_r64245422

          — Diff: storm-core/src/jvm/org/apache/storm/scheduler/resource/strategies/scheduling/DefaultResourceAwareStrategy.java —
          @@ -45,6 +47,7 @@
          import org.apache.storm.scheduler.WorkerSlot;
          import org.apache.storm.scheduler.resource.Component;

          +
          — End diff –

          extra line?

          Show
          githubbot ASF GitHub Bot added a comment - Github user redsanket commented on a diff in the pull request: https://github.com/apache/storm/pull/1398#discussion_r64245422 — Diff: storm-core/src/jvm/org/apache/storm/scheduler/resource/strategies/scheduling/DefaultResourceAwareStrategy.java — @@ -45,6 +47,7 @@ import org.apache.storm.scheduler.WorkerSlot; import org.apache.storm.scheduler.resource.Component; + — End diff – extra line?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user jerrypeng commented on a diff in the pull request:

          https://github.com/apache/storm/pull/1398#discussion_r64404386

          — Diff: storm-core/src/jvm/org/apache/storm/scheduler/resource/strategies/scheduling/DefaultResourceAwareStrategy.java —
          @@ -45,6 +47,7 @@
          import org.apache.storm.scheduler.WorkerSlot;
          import org.apache.storm.scheduler.resource.Component;

          +
          — End diff –

          will remove

          Show
          githubbot ASF GitHub Bot added a comment - Github user jerrypeng commented on a diff in the pull request: https://github.com/apache/storm/pull/1398#discussion_r64404386 — Diff: storm-core/src/jvm/org/apache/storm/scheduler/resource/strategies/scheduling/DefaultResourceAwareStrategy.java — @@ -45,6 +47,7 @@ import org.apache.storm.scheduler.WorkerSlot; import org.apache.storm.scheduler.resource.Component; + — End diff – will remove
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user jerrypeng commented on the pull request:

          https://github.com/apache/storm/pull/1398#issuecomment-221314974

          @redsanket thanks for the review. Do you have any other comments?

          Show
          githubbot ASF GitHub Bot added a comment - Github user jerrypeng commented on the pull request: https://github.com/apache/storm/pull/1398#issuecomment-221314974 @redsanket thanks for the review. Do you have any other comments?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user ptgoetz commented on the pull request:

          https://github.com/apache/storm/pull/1398#issuecomment-221383374

          +1

          @jerrypeng Can you file a lira for updating the documentation if necessary?

          Show
          githubbot ASF GitHub Bot added a comment - Github user ptgoetz commented on the pull request: https://github.com/apache/storm/pull/1398#issuecomment-221383374 +1 @jerrypeng Can you file a lira for updating the documentation if necessary?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user jerrypeng commented on the pull request:

          https://github.com/apache/storm/pull/1398#issuecomment-221665618

          @ptgoetz I have created a jira:
          https://issues.apache.org/jira/browse/STORM-1866

          Show
          githubbot ASF GitHub Bot added a comment - Github user jerrypeng commented on the pull request: https://github.com/apache/storm/pull/1398#issuecomment-221665618 @ptgoetz I have created a jira: https://issues.apache.org/jira/browse/STORM-1866
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/storm/pull/1398

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/storm/pull/1398
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user ptgoetz commented on the pull request:

          https://github.com/apache/storm/pull/1398#issuecomment-221677136

          @jerrypeng Did you merge this to any other branches, or just master?

          Show
          githubbot ASF GitHub Bot added a comment - Github user ptgoetz commented on the pull request: https://github.com/apache/storm/pull/1398#issuecomment-221677136 @jerrypeng Did you merge this to any other branches, or just master?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user jerrypeng commented on the pull request:

          https://github.com/apache/storm/pull/1398#issuecomment-221678273

          @ptgoetz just merged it into master why?

          Show
          githubbot ASF GitHub Bot added a comment - Github user jerrypeng commented on the pull request: https://github.com/apache/storm/pull/1398#issuecomment-221678273 @ptgoetz just merged it into master why?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user ptgoetz commented on the pull request:

          https://github.com/apache/storm/pull/1398#issuecomment-221686373

          @jerrypeng For tracking what goes into each branch/release. Github only gives us merge notifications for the branch a pull request targeted. If you had merged this to other branches, we wouldn't know unless we looked for it in other branches. That's why most of the time we add a comment noting which branches a patch was applied to. It saves a little time for other committers.

          Show
          githubbot ASF GitHub Bot added a comment - Github user ptgoetz commented on the pull request: https://github.com/apache/storm/pull/1398#issuecomment-221686373 @jerrypeng For tracking what goes into each branch/release. Github only gives us merge notifications for the branch a pull request targeted. If you had merged this to other branches, we wouldn't know unless we looked for it in other branches. That's why most of the time we add a comment noting which branches a patch was applied to. It saves a little time for other committers.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user jerrypeng commented on the pull request:

          https://github.com/apache/storm/pull/1398#issuecomment-221686962

          @ptgoetz oh i see, thanks for letting me know! I will remember next time to put a comment in the jira regarding which branches i merged the corresponding PR to.

          Show
          githubbot ASF GitHub Bot added a comment - Github user jerrypeng commented on the pull request: https://github.com/apache/storm/pull/1398#issuecomment-221686962 @ptgoetz oh i see, thanks for letting me know! I will remember next time to put a comment in the jira regarding which branches i merged the corresponding PR to.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user HeartSaVioR commented on the issue:

          https://github.com/apache/storm/pull/1398

          @jerrypeng Since master branch is target to 2.0.0 and we don't have timeframe so it may be better to add it to 1.1.0 if you think it's not experimental feature.

          Show
          githubbot ASF GitHub Bot added a comment - Github user HeartSaVioR commented on the issue: https://github.com/apache/storm/pull/1398 @jerrypeng Since master branch is target to 2.0.0 and we don't have timeframe so it may be better to add it to 1.1.0 if you think it's not experimental feature.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user jerrypeng opened a pull request:

          https://github.com/apache/storm/pull/1621

          STORM-1766 - A better algorithm server rack selection for RAS

          Backport of #1398 to 1.x branch. I'm not sure this actually needs a PR, but since it's been a while since #1500 was merged, I'll put one anyways since the code went into 2.x a while ago

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/jerrypeng/storm 1.x-STORM-1766

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/storm/pull/1621.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #1621


          commit d50a2437923f2919fbee130b8e9e86a62a2d9f48
          Author: Boyang Jerry Peng <jerrypeng@yahoo-inc.com>
          Date: 2016-05-04T22:08:57Z

          STORM-1766 - A better algorithm server rack selection for RAS


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user jerrypeng opened a pull request: https://github.com/apache/storm/pull/1621 STORM-1766 - A better algorithm server rack selection for RAS Backport of #1398 to 1.x branch. I'm not sure this actually needs a PR, but since it's been a while since #1500 was merged, I'll put one anyways since the code went into 2.x a while ago You can merge this pull request into a Git repository by running: $ git pull https://github.com/jerrypeng/storm 1.x- STORM-1766 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/storm/pull/1621.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1621 commit d50a2437923f2919fbee130b8e9e86a62a2d9f48 Author: Boyang Jerry Peng <jerrypeng@yahoo-inc.com> Date: 2016-05-04T22:08:57Z STORM-1766 - A better algorithm server rack selection for RAS
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user knusbaum commented on the issue:

          https://github.com/apache/storm/pull/1621

          +1

          Show
          githubbot ASF GitHub Bot added a comment - Github user knusbaum commented on the issue: https://github.com/apache/storm/pull/1621 +1
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user harshach commented on a diff in the pull request:

          https://github.com/apache/storm/pull/1621#discussion_r74524961

          — Diff: storm-core/src/jvm/org/apache/storm/scheduler/Cluster.java —
          @@ -103,6 +103,9 @@ public Cluster(Cluster src) {
          this.status.putAll(src.status);
          this.topologyResources.putAll(src.topologyResources);
          this.blackListedHosts.addAll(src.blackListedHosts);
          + if (src.networkTopography != null) {
          — End diff –

          is this supposed to be == null. why are we creating new Map if there is one already.

          Show
          githubbot ASF GitHub Bot added a comment - Github user harshach commented on a diff in the pull request: https://github.com/apache/storm/pull/1621#discussion_r74524961 — Diff: storm-core/src/jvm/org/apache/storm/scheduler/Cluster.java — @@ -103,6 +103,9 @@ public Cluster(Cluster src) { this.status.putAll(src.status); this.topologyResources.putAll(src.topologyResources); this.blackListedHosts.addAll(src.blackListedHosts); + if (src.networkTopography != null) { — End diff – is this supposed to be == null. why are we creating new Map if there is one already.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user HeartSaVioR commented on the issue:

          https://github.com/apache/storm/pull/1621

          +1

          Show
          githubbot ASF GitHub Bot added a comment - Github user HeartSaVioR commented on the issue: https://github.com/apache/storm/pull/1621 +1
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/storm/pull/1621

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/storm/pull/1621

            People

            • Assignee:
              jerrypeng Boyang Jerry Peng
              Reporter:
              jerrypeng Boyang Jerry Peng
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development