Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-7057

FSAppAttempt#getResourceUsage doesn't need to consider resources queued for preemption

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.9.0
    • Fix Version/s: 2.9.0, 3.0.0-beta1, 3.1.0
    • Component/s: fairscheduler
    • Labels:
      None

      Description

      FSAppAttempt#getResourceUsage excludes resources that are currently allocated to the app but are about to be preempted. This inconsistency shows in the UI and can affect scheduling of containers.

      1. YARN-7057.001.patch
        2 kB
        Karthik Kambatla
      2. YARN-7057.002.patch
        2 kB
        Yufei Gu

        Activity

        Hide
        kasha Karthik Kambatla added a comment -

        Patch simplifies getResourceUsage and updates canContainerBePreempted to consider resources already queued for preemption on this app.

        Show
        kasha Karthik Kambatla added a comment - Patch simplifies getResourceUsage and updates canContainerBePreempted to consider resources already queued for preemption on this app.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 19s Docker mode activated.
              Prechecks
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
              trunk Compile Tests
        +1 mvninstall 15m 27s trunk passed
        +1 compile 0m 39s trunk passed
        +1 checkstyle 0m 26s trunk passed
        +1 mvnsite 0m 38s trunk passed
        +1 findbugs 1m 13s trunk passed
        +1 javadoc 0m 24s trunk passed
              Patch Compile Tests
        +1 mvninstall 0m 35s the patch passed
        +1 compile 0m 35s the patch passed
        +1 javac 0m 35s the patch passed
        +1 checkstyle 0m 24s the patch passed
        +1 mvnsite 0m 40s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 13s the patch passed
        +1 javadoc 0m 21s the patch passed
              Other Tests
        -1 unit 44m 51s hadoop-yarn-server-resourcemanager in the patch failed.
        +1 asflicense 0m 14s The patch does not generate ASF License warnings.
        69m 18s



        Reason Tests
        Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:14b5c93
        JIRA Issue YARN-7057
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12882757/YARN-7057.001.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 0f6334c7e31b 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 436c263
        Default Java 1.8.0_144
        findbugs v3.1.0-RC1
        unit https://builds.apache.org/job/PreCommit-YARN-Build/17017/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/17017/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/17017/console
        Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 19s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.       trunk Compile Tests +1 mvninstall 15m 27s trunk passed +1 compile 0m 39s trunk passed +1 checkstyle 0m 26s trunk passed +1 mvnsite 0m 38s trunk passed +1 findbugs 1m 13s trunk passed +1 javadoc 0m 24s trunk passed       Patch Compile Tests +1 mvninstall 0m 35s the patch passed +1 compile 0m 35s the patch passed +1 javac 0m 35s the patch passed +1 checkstyle 0m 24s the patch passed +1 mvnsite 0m 40s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 13s the patch passed +1 javadoc 0m 21s the patch passed       Other Tests -1 unit 44m 51s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 14s The patch does not generate ASF License warnings. 69m 18s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue YARN-7057 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12882757/YARN-7057.001.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 0f6334c7e31b 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 436c263 Default Java 1.8.0_144 findbugs v3.1.0-RC1 unit https://builds.apache.org/job/PreCommit-YARN-Build/17017/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/17017/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/17017/console Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        kasha Karthik Kambatla added a comment -

        The test failure in CapacityScheduler should be unrelated.

        Existing tests should verify both the behavior of getResourceUsage and canContainerBePreempted. Hence, no new tests.

        Show
        kasha Karthik Kambatla added a comment - The test failure in CapacityScheduler should be unrelated. Existing tests should verify both the behavior of getResourceUsage and canContainerBePreempted. Hence, no new tests.
        Hide
        yufeigu Yufei Gu added a comment -

        +1

        Show
        yufeigu Yufei Gu added a comment - +1
        Hide
        templedf Daniel Templeton added a comment -

        Yep, capacity scheduler failure is unrelated. Looks like a generally sane change to me. Quick question, though: why the change to getResourceUsage()? That change would seem to have a pretty large impact. For example, also in FSAppAttempt:

          Resource getPendingDemand() {
            return Resources.subtract(demand, getResourceUsage());
          }

        Are we sure that change isn't going to cause problems?

        Show
        templedf Daniel Templeton added a comment - Yep, capacity scheduler failure is unrelated. Looks like a generally sane change to me. Quick question, though: why the change to getResourceUsage() ? That change would seem to have a pretty large impact. For example, also in FSAppAttempt : Resource getPendingDemand() { return Resources.subtract(demand, getResourceUsage()); } Are we sure that change isn't going to cause problems?
        Hide
        yufeigu Yufei Gu added a comment -

        Looking around the code, getPendingDemand() shouldn't be affected in a bad way, it will actually be affect in a good way. One of callers of getPendingDemand() decides which application is starve. Without this change, updateStarvedApps looks fishy.

        Show
        yufeigu Yufei Gu added a comment - Looking around the code, getPendingDemand() shouldn't be affected in a bad way, it will actually be affect in a good way. One of callers of getPendingDemand() decides which application is starve. Without this change, updateStarvedApps looks fishy.
        Hide
        templedf Daniel Templeton added a comment -

        There are a lot of consumers of getResourceUsage() though one channel or another. Have we exhaustively checked to make sure none are negatively impacted? Passing unit tests isn't proof enough.

        Show
        templedf Daniel Templeton added a comment - There are a lot of consumers of getResourceUsage() though one channel or another. Have we exhaustively checked to make sure none are negatively impacted? Passing unit tests isn't proof enough.
        Hide
        kasha Karthik Kambatla added a comment -

        Daniel Templeton - good questions.

        Reasons why getResourceUsage() needs change:

        1. Subtracting resources queued for preemption before returning current allocation is plain incorrect, both to users who see this information through UI and other calculations in the scheduler. It is possible that the application gives up another container, and the current container queued for preemption is never preempted. Or, more resources are added to the cluster. Basically, I just feel we shouldn't count our chicken before they hatch.
        2. Acquiring the lock is an unnecessary overhead all calls to getResourceUsage incur. This gets called, recursively, so frequently that this can easily add up. Remember the frequency of calls is the reason we added an if-check in the first place.

        I did manually check exhaustively all places getResourceUsage is called. In fact, my first patch (have not posted) was to have two methods getResourceUsage and getEffectiveResourceUsage. While working on that version, I carefully considered every call and realized there is only one place calling getEffectiveResourceUsage. That is when it dawned on me that I could just change that one caller. That said, I do appreciate another pair of eyes validate my claim.

        Show
        kasha Karthik Kambatla added a comment - Daniel Templeton - good questions. Reasons why getResourceUsage() needs change: Subtracting resources queued for preemption before returning current allocation is plain incorrect, both to users who see this information through UI and other calculations in the scheduler. It is possible that the application gives up another container, and the current container queued for preemption is never preempted. Or, more resources are added to the cluster. Basically, I just feel we shouldn't count our chicken before they hatch. Acquiring the lock is an unnecessary overhead all calls to getResourceUsage incur. This gets called, recursively, so frequently that this can easily add up. Remember the frequency of calls is the reason we added an if-check in the first place. I did manually check exhaustively all places getResourceUsage is called. In fact, my first patch (have not posted) was to have two methods getResourceUsage and getEffectiveResourceUsage. While working on that version, I carefully considered every call and realized there is only one place calling getEffectiveResourceUsage. That is when it dawned on me that I could just change that one caller. That said, I do appreciate another pair of eyes validate my claim.
        Hide
        templedf Daniel Templeton added a comment -

        OK, I double-check the uses of getResourceUsage(), and I agree. The change ranges from benign to necessary, depending on the caller. The only one that's a little suspicious is isStarvedForFairShare(), where it seems we might now be penalizing the app for the to-be-preempted resources. Yufei Gu, if you're happy with it, I think it's OK to commit.

        Show
        templedf Daniel Templeton added a comment - OK, I double-check the uses of getResourceUsage() , and I agree. The change ranges from benign to necessary, depending on the caller. The only one that's a little suspicious is isStarvedForFairShare() , where it seems we might now be penalizing the app for the to-be-preempted resources. Yufei Gu , if you're happy with it, I think it's OK to commit.
        Hide
        yufeigu Yufei Gu added a comment -

        Thanks Daniel Templeton for the review. Karthik Kambatla, this patch doesn't apply to trunk. Can you rebase? or I can do the rebase if you are OK with that.

        Show
        yufeigu Yufei Gu added a comment - Thanks Daniel Templeton for the review. Karthik Kambatla , this patch doesn't apply to trunk. Can you rebase? or I can do the rebase if you are OK with that.
        Hide
        yufeigu Yufei Gu added a comment -

        Rebase with patch v2.

        Show
        yufeigu Yufei Gu added a comment - Rebase with patch v2.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 22s Docker mode activated.
              Prechecks
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
              trunk Compile Tests
        +1 mvninstall 17m 21s trunk passed
        +1 compile 0m 37s trunk passed
        +1 checkstyle 0m 26s trunk passed
        +1 mvnsite 0m 39s trunk passed
        +1 findbugs 1m 0s trunk passed
        +1 javadoc 0m 21s trunk passed
              Patch Compile Tests
        +1 mvninstall 0m 33s the patch passed
        +1 compile 0m 32s the patch passed
        +1 javac 0m 32s the patch passed
        +1 checkstyle 0m 23s the patch passed
        +1 mvnsite 0m 34s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 6s the patch passed
        +1 javadoc 0m 19s the patch passed
              Other Tests
        -1 unit 44m 22s hadoop-yarn-server-resourcemanager in the patch failed.
        +1 asflicense 0m 15s The patch does not generate ASF License warnings.
        70m 5s



        Reason Tests
        Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation
          hadoop.yarn.server.resourcemanager.scheduler.TestAbstractYarnScheduler
          hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:71bbb86
        JIRA Issue YARN-7057
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12886728/YARN-7057.002.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux a58ccdecbf9a 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 86f4d1c
        Default Java 1.8.0_144
        findbugs v3.1.0-RC1
        unit https://builds.apache.org/job/PreCommit-YARN-Build/17425/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/17425/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/17425/console
        Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 22s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.       trunk Compile Tests +1 mvninstall 17m 21s trunk passed +1 compile 0m 37s trunk passed +1 checkstyle 0m 26s trunk passed +1 mvnsite 0m 39s trunk passed +1 findbugs 1m 0s trunk passed +1 javadoc 0m 21s trunk passed       Patch Compile Tests +1 mvninstall 0m 33s the patch passed +1 compile 0m 32s the patch passed +1 javac 0m 32s the patch passed +1 checkstyle 0m 23s the patch passed +1 mvnsite 0m 34s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 6s the patch passed +1 javadoc 0m 19s the patch passed       Other Tests -1 unit 44m 22s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 15s The patch does not generate ASF License warnings. 70m 5s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation   hadoop.yarn.server.resourcemanager.scheduler.TestAbstractYarnScheduler   hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy Subsystem Report/Notes Docker Image:yetus/hadoop:71bbb86 JIRA Issue YARN-7057 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12886728/YARN-7057.002.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux a58ccdecbf9a 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 86f4d1c Default Java 1.8.0_144 findbugs v3.1.0-RC1 unit https://builds.apache.org/job/PreCommit-YARN-Build/17425/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/17425/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/17425/console Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        yufeigu Yufei Gu added a comment -

        Committed to trunk and branch-2. Thanks for the patch, Karthik Kambatla. Thanks for the review, Daniel Templeton.

        Show
        yufeigu Yufei Gu added a comment - Committed to trunk and branch-2. Thanks for the patch, Karthik Kambatla . Thanks for the review, Daniel Templeton .
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #12854 (See https://builds.apache.org/job/Hadoop-trunk-Commit/12854/)
        YARN-7057. FSAppAttempt#getResourceUsage doesn't need to consider (yufei: rev 82c5dd1d508292ed88eda0f5356776437ba67d4c)

        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #12854 (See https://builds.apache.org/job/Hadoop-trunk-Commit/12854/ ) YARN-7057 . FSAppAttempt#getResourceUsage doesn't need to consider (yufei: rev 82c5dd1d508292ed88eda0f5356776437ba67d4c) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java
        Hide
        yufeigu Yufei Gu added a comment -

        Committed to branch-3.0.

        Show
        yufeigu Yufei Gu added a comment - Committed to branch-3.0.

          People

          • Assignee:
            kasha Karthik Kambatla
            Reporter:
            kasha Karthik Kambatla
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development