Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.9.0
-
None
-
Reviewed
Description
Today, on a saturated cluster, apps with pending demand reserve nodes. A new app might not be able to preempt resources because these nodes are already reserved. This can be reproduced by the example in YARN-6151.
Since node reservations are to prevent starvation of apps requesting large containers, triggering these reservations only on starved applications would avoid this situation.
Attachments
Attachments
- YARN-6210.1.patch
- 16 kB
- Karthik Kambatla
- YARN-6210.2.patch
- 26 kB
- Karthik Kambatla
- YARN-6210.3.patch
- 27 kB
- Karthik Kambatla
- YARN-6210.4.patch
- 27 kB
- Karthik Kambatla
Issue Links
Activity
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 16s | Docker mode activated. |
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 1 new or modified test files. |
0 | mvndep | 0m 46s | Maven dependency ordering for branch |
+1 | mvninstall | 14m 17s | trunk passed |
+1 | compile | 8m 35s | trunk passed |
+1 | checkstyle | 0m 54s | trunk passed |
+1 | mvnsite | 1m 22s | trunk passed |
+1 | mvneclipse | 0m 42s | trunk passed |
+1 | findbugs | 2m 38s | trunk passed |
+1 | javadoc | 1m 6s | trunk passed |
0 | mvndep | 0m 11s | Maven dependency ordering for patch |
+1 | mvninstall | 1m 7s | the patch passed |
+1 | compile | 6m 59s | the patch passed |
+1 | javac | 6m 59s | the patch passed |
+1 | checkstyle | 0m 59s | the patch passed |
+1 | mvnsite | 1m 35s | the patch passed |
+1 | mvneclipse | 0m 40s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | findbugs | 3m 11s | the patch passed |
-1 | javadoc | 0m 39s | hadoop-yarn-project_hadoop-yarn_hadoop-yarn-common generated 13 new + 4575 unchanged - 0 fixed = 4588 total (was 4575) |
+1 | unit | 2m 57s | hadoop-yarn-common in the patch passed. |
-1 | unit | 42m 34s | hadoop-yarn-server-resourcemanager in the patch failed. |
+1 | asflicense | 0m 32s | The patch does not generate ASF License warnings. |
101m 3s |
Reason | Tests |
---|---|
Failed junit tests | hadoop.yarn.server.resourcemanager.TestRMRestart |
hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler | |
hadoop.yarn.server.resourcemanager.TestRMAdminService |
Subsystem | Report/Notes |
---|---|
Docker | Image:yetus/hadoop:a9ad5d6 |
JIRA Issue | |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12853639/YARN-6210.1.patch |
Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle |
uname | Linux 7273cca2149a 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh |
git revision | trunk / 8035749 |
Default Java | 1.8.0_121 |
findbugs | v3.0.0 |
javadoc | https://builds.apache.org/job/PreCommit-YARN-Build/15017/artifact/patchprocess/diff-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-common.txt |
unit | https://builds.apache.org/job/PreCommit-YARN-Build/15017/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt |
Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/15017/testReport/ |
modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn |
Console output | https://builds.apache.org/job/PreCommit-YARN-Build/15017/console |
Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org |
This message was automatically generated.
Patch (v2) fixes:
- The javadoc warnings
- TestFairScheduler tests around reservation. It was very hard to understand the intent of failing tests. Updated the tests based on my understanding.
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 24s | Docker mode activated. |
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 2 new or modified test files. |
0 | mvndep | 0m 8s | Maven dependency ordering for branch |
+1 | mvninstall | 13m 5s | trunk passed |
+1 | compile | 8m 9s | trunk passed |
+1 | checkstyle | 0m 50s | trunk passed |
+1 | mvnsite | 1m 14s | trunk passed |
+1 | mvneclipse | 0m 39s | trunk passed |
+1 | findbugs | 2m 19s | trunk passed |
+1 | javadoc | 1m 0s | trunk passed |
0 | mvndep | 0m 10s | Maven dependency ordering for patch |
+1 | mvninstall | 0m 58s | the patch passed |
+1 | compile | 6m 30s | the patch passed |
+1 | javac | 6m 30s | the patch passed |
+1 | checkstyle | 0m 49s | the patch passed |
+1 | mvnsite | 1m 12s | the patch passed |
+1 | mvneclipse | 0m 37s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | findbugs | 2m 29s | the patch passed |
+1 | javadoc | 0m 59s | the patch passed |
+1 | unit | 2m 29s | hadoop-yarn-common in the patch passed. |
-1 | unit | 41m 1s | hadoop-yarn-server-resourcemanager in the patch failed. |
+1 | asflicense | 0m 30s | The patch does not generate ASF License warnings. |
94m 4s |
Reason | Tests |
---|---|
Failed junit tests | hadoop.yarn.server.resourcemanager.TestRMRestart |
hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer |
Subsystem | Report/Notes |
---|---|
Docker | Image:yetus/hadoop:a9ad5d6 |
JIRA Issue | |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12853691/YARN-6210.2.patch |
Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle |
uname | Linux 9ff8da4ca724 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh |
git revision | trunk / 6ba61d2 |
Default Java | 1.8.0_121 |
findbugs | v3.0.0 |
unit | https://builds.apache.org/job/PreCommit-YARN-Build/15021/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt |
Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/15021/testReport/ |
modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn |
Console output | https://builds.apache.org/job/PreCommit-YARN-Build/15021/console |
Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org |
This message was automatically generated.
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 12s | Docker mode activated. |
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 2 new or modified test files. |
0 | mvndep | 0m 45s | Maven dependency ordering for branch |
+1 | mvninstall | 12m 44s | trunk passed |
+1 | compile | 7m 56s | trunk passed |
+1 | checkstyle | 0m 49s | trunk passed |
+1 | mvnsite | 1m 12s | trunk passed |
+1 | mvneclipse | 0m 37s | trunk passed |
+1 | findbugs | 2m 7s | trunk passed |
+1 | javadoc | 1m 0s | trunk passed |
0 | mvndep | 0m 9s | Maven dependency ordering for patch |
+1 | mvninstall | 0m 56s | the patch passed |
+1 | compile | 6m 3s | the patch passed |
+1 | javac | 6m 3s | the patch passed |
+1 | checkstyle | 0m 48s | the patch passed |
+1 | mvnsite | 1m 9s | the patch passed |
+1 | mvneclipse | 0m 36s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | findbugs | 2m 19s | the patch passed |
+1 | javadoc | 0m 57s | the patch passed |
+1 | unit | 2m 24s | hadoop-yarn-common in the patch passed. |
-1 | unit | 54m 55s | hadoop-yarn-server-resourcemanager in the patch failed. |
+1 | asflicense | 0m 31s | The patch does not generate ASF License warnings. |
106m 43s |
Reason | Tests |
---|---|
Failed junit tests | hadoop.yarn.server.resourcemanager.TestRMRestart |
Timed out junit tests | org.apache.hadoop.yarn.server.resourcemanager.reservation.TestFairSchedulerPlanFollower |
Subsystem | Report/Notes |
---|---|
Docker | Image:yetus/hadoop:a9ad5d6 |
JIRA Issue | |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12853691/YARN-6210.2.patch |
Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle |
uname | Linux d9dbfd9effa5 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh |
git revision | trunk / 6ba61d2 |
Default Java | 1.8.0_121 |
findbugs | v3.0.0 |
unit | https://builds.apache.org/job/PreCommit-YARN-Build/15022/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt |
Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/15022/testReport/ |
modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn |
Console output | https://builds.apache.org/job/PreCommit-YARN-Build/15022/console |
Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org |
This message was automatically generated.
Thanks for the patch, kasha. A few comments:
- ResourceCalculator.compare(): singleType needs to be explained better. I still don't know what it means.
- TestFailSchedulerPreemption.writeAllocFile(): should preemptable maybe be preemptable-1?
- FSAppAttempt.isStarvedForFairShare(): No likey the "0 > ..." There's no reason for the weird inverted order, and less than x is easier to logic about.
- FSAppAttempt.isStarved(): looks like it can be private...
- FSAppAttempt.fairShareStarvation(): the meaning of starved changed. It would be nice to document in a comment why it's OK that starved doesn't account for the threshold whereas the return value does.
- TestFairScheduler.testReservationWhileMultiplePriorities(): The comment Create one application and take up all resources isn't actually true, is it?
- TestFairScheduler.testReservationWhileMultiplePriorities(): Looks like you took out the tests that the scheduler resources are as expected at the various steps and the test that the reservation is still for the lower priority request. Not critical, but I would leave them in if it were me.
- TestFairScheduler.testReservationsStrictLocality: I like messages in my asserts.
Thanks for the prompt review, Daniel.
The updated patch incorporates all your suggestions but for the following:
- FSAppAttempt#fairShareStarvation(): I have reverted the meaning of starved. The reason for this is to ensure we don't mark an app whose demand is fully met, but the allocation is under its fairshare. The code also seems simpler and less confusing this way.
- TestFairScheduler.testReservationWithMultiplePriorities(): The reservation-at-lower-priority assert is retained. I did drop the asserts for scheduler resources, but retained the checks around running containers. Since the test is for verifying node reservation behavior, other asserts are misleading.
+1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 10s | Docker mode activated. |
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 2 new or modified test files. |
0 | mvndep | 0m 43s | Maven dependency ordering for branch |
+1 | mvninstall | 12m 45s | trunk passed |
+1 | compile | 8m 21s | trunk passed |
+1 | checkstyle | 0m 50s | trunk passed |
+1 | mvnsite | 1m 15s | trunk passed |
+1 | mvneclipse | 0m 38s | trunk passed |
+1 | findbugs | 2m 13s | trunk passed |
+1 | javadoc | 1m 1s | trunk passed |
0 | mvndep | 0m 10s | Maven dependency ordering for patch |
+1 | mvninstall | 1m 3s | the patch passed |
+1 | compile | 6m 51s | the patch passed |
+1 | javac | 6m 51s | the patch passed |
+1 | checkstyle | 0m 51s | the patch passed |
+1 | mvnsite | 1m 20s | the patch passed |
+1 | mvneclipse | 0m 39s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | findbugs | 2m 39s | the patch passed |
+1 | javadoc | 1m 2s | the patch passed |
+1 | unit | 2m 32s | hadoop-yarn-common in the patch passed. |
+1 | unit | 41m 40s | hadoop-yarn-server-resourcemanager in the patch passed. |
+1 | asflicense | 0m 31s | The patch does not generate ASF License warnings. |
95m 57s |
Subsystem | Report/Notes |
---|---|
Docker | Image:yetus/hadoop:a9ad5d6 |
JIRA Issue | |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12853866/YARN-6210.3.patch |
Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle |
uname | Linux aea153c517c1 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh |
git revision | trunk / 003ae00 |
Default Java | 1.8.0_121 |
findbugs | v3.0.0 |
Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/15034/testReport/ |
modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn |
Console output | https://builds.apache.org/job/PreCommit-YARN-Build/15034/console |
Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org |
This message was automatically generated.
Looks good. A couple tiny nits:
- FSAppAttempt.isStarved(Resource usage, Resource share): probably a bad name. It doesn't have anything to do with starving. It's a comparator.
Nope, the assignment-at-lower-priority assert is retained. The original test also had an assert to test the reservation itself:The reservation-at-lower-priority assert is retained.
// Reserved container should still be at lower priority for (RMContainer container : app.getReservedContainers()) { assertEquals(2, container.getReservedSchedulerKey().getPriority().getPriority()); }
- On the assert messages, please keep in mind what the person triaging the failures will see, e.g. "Basic allocation failed expected:<1> but was:<0>." The message doesn't explain the numbers. Not a major issue, and not something I expect you to fix, but I wanted to point it out.
FSAppAttempt.isStarved(Resource usage, Resource share): probably a bad name. It doesn't have anything to do with starving. It's a comparator.
I agree it is not the best name, but I couldn't come up with a better name that will not be used incorrectly in the future. I am happy to accommodate any valid suggestions.
Nope, the assignment-at-lower-priority assert is retained. The original test also had an assert to test the reservation itself:
This test deals in fewer containers. There is a check for reservation succeeding and subsequently allocation succeeding. Since the behavior changed on when reservations are allowed, the app cannot reserve more than one container.
bq I am happy to accommodate any valid suggestions.
How about isUsageBelowShare()?
+1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 12s | Docker mode activated. |
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 2 new or modified test files. |
0 | mvndep | 0m 43s | Maven dependency ordering for branch |
+1 | mvninstall | 12m 35s | trunk passed |
+1 | compile | 8m 43s | trunk passed |
+1 | checkstyle | 0m 49s | trunk passed |
+1 | mvnsite | 1m 13s | trunk passed |
+1 | mvneclipse | 0m 39s | trunk passed |
+1 | findbugs | 2m 5s | trunk passed |
+1 | javadoc | 0m 59s | trunk passed |
0 | mvndep | 0m 10s | Maven dependency ordering for patch |
+1 | mvninstall | 0m 56s | the patch passed |
+1 | compile | 6m 6s | the patch passed |
+1 | javac | 6m 6s | the patch passed |
+1 | checkstyle | 0m 47s | the patch passed |
+1 | mvnsite | 1m 11s | the patch passed |
+1 | mvneclipse | 0m 36s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | findbugs | 2m 31s | the patch passed |
+1 | javadoc | 1m 1s | the patch passed |
+1 | unit | 2m 36s | hadoop-yarn-common in the patch passed. |
+1 | unit | 40m 48s | hadoop-yarn-server-resourcemanager in the patch passed. |
+1 | asflicense | 0m 31s | The patch does not generate ASF License warnings. |
93m 44s |
Subsystem | Report/Notes |
---|---|
Docker | Image:yetus/hadoop:a9ad5d6 |
JIRA Issue | |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12854043/YARN-6210.4.patch |
Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle |
uname | Linux da34fbf21de6 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh |
git revision | trunk / 1a6ca75 |
Default Java | 1.8.0_121 |
findbugs | v3.0.0 |
Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/15047/testReport/ |
modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn |
Console output | https://builds.apache.org/job/PreCommit-YARN-Build/15047/console |
Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org |
This message was automatically generated.
I got two +1s from Jenkins. templedf - don't you think that is enough to get a +1 from a committer too?
You're gonna love me for this one...
We now have:
if (!starved) { lastTimeAtFairShare = now; } if (!starved || now - lastTimeAtFairShare < fsQueue.getFairSharePreemptionTimeout()) { fairshareStarvation = Resources.none(); }
What about:
if (!starved || now - lastTimeAtFairShare < fsQueue.getFairSharePreemptionTimeout()) { if (!starved) { lastTimeAtFairShare = now; } fairshareStarvation = Resources.none(); }
?
Just a thought. I'm a +1 either way.
I like what we have better. It is clearer; nested ifs are more confusing because one has to hold the conditions of the previous if in context. Also, I don't see it being more efficient or anything either.
Since you are +1 either way, I ll go ahead and commit this.
Just committed this to trunk and branch-2. Thanks templedf for the careful reviews.
SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11291 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11291/)
YARN-6210. FairScheduler: Node reservations can interfere with (kasha: rev 718ad9f6ee93d4145f2bb19b7582ce4e1174feaf)
- (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FairSharePolicy.java
- (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/DominantResourceFairnessPolicy.java
- (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java
- (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java
- (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java
- (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/DominantResourceCalculator.java
- (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/DefaultResourceCalculator.java
- (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerPreemption.java
Patch (v1) does the following:
YARN-6151. FYI. yufeigu