Details
-
Sub-task
-
Status: Patch Available
-
Major
-
Resolution: Unresolved
-
3.0.0
-
None
-
None
-
Patch
Description
FSStarvedApps is not thread safe, this may make one starve app is processed for two times continuously.
For example, when app1 is fair share starved, it has been added to appsToProcess. After that, app1 is taken but appBeingProcessed is not yet update to app1. At the moment, app1 is starved by min share, so this app is added to appsToProcess again! Because appBeingProcessed is null and appsToProcess also have not this one.
void addStarvedApp(FSAppAttempt app) { if (!app.equals(appBeingProcessed) && !appsToProcess.contains(app)) { appsToProcess.add(app); } } FSAppAttempt take() throws InterruptedException { // Reset appBeingProcessed before the blocking call appBeingProcessed = null; // Blocking call to fetch the next starved application FSAppAttempt app = appsToProcess.take(); appBeingProcessed = app; return app; }
Attachments
Attachments
- YARN-8655.002.patch
- 5 kB
- Zhaohui Xin
- YARN-8655.patch
- 5 kB
- Zhaohui Xin
Issue Links
Activity
Hi uranus,
Thanks for uploading the patch.
If I understand correctly the reason to replace take() with poll() is that you don't want to do synchronization on a blocking call.
I have some suggestions about it:
1) Method name FSStarvedApps.take() is not consistent since it is a poll not a take
2) Comments tell in FSStarvedApps that take is a blocking method
3) The behavior of the class changed, in your solution the thread will run like an endless loop very often and will do a high-cost lock in every iteration. Maybe it would worth to add some sleep in the loop just like before.
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 25s | Docker mode activated. |
Prechecks | |||
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 1 new or modified test files. |
trunk Compile Tests | |||
+1 | mvninstall | 18m 0s | trunk passed |
+1 | compile | 0m 46s | trunk passed |
+1 | checkstyle | 0m 37s | trunk passed |
+1 | mvnsite | 0m 49s | trunk passed |
+1 | shadedclient | 12m 33s | branch has no errors when building and testing our client artifacts. |
+1 | findbugs | 1m 18s | trunk passed |
+1 | javadoc | 0m 29s | trunk passed |
Patch Compile Tests | |||
+1 | mvninstall | 0m 45s | the patch passed |
+1 | compile | 0m 42s | the patch passed |
+1 | javac | 0m 42s | the patch passed |
+1 | checkstyle | 0m 34s | the patch passed |
+1 | mvnsite | 0m 46s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | shadedclient | 12m 26s | patch has no errors when building and testing our client artifacts. |
+1 | findbugs | 1m 24s | the patch passed |
+1 | javadoc | 0m 26s | the patch passed |
Other Tests | |||
-1 | unit | 92m 57s | hadoop-yarn-server-resourcemanager in the patch failed. |
+1 | asflicense | 0m 24s | The patch does not generate ASF License warnings. |
144m 58s |
Subsystem | Report/Notes |
---|---|
Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
JIRA Issue | YARN-8655 |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12935382/YARN-8655.patch |
Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle |
uname | Linux ddcf50866132 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/patchprocess/precommit/personality/provided.sh |
git revision | trunk / 965d26c |
maven | version: Apache Maven 3.3.9 |
Default Java | 1.8.0_191 |
findbugs | v3.1.0-RC1 |
unit | https://builds.apache.org/job/PreCommit-YARN-Build/23361/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt |
Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/23361/testReport/ |
Max. process+thread count | 957 (vs. ulimit of 10000) |
modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager |
Console output | https://builds.apache.org/job/PreCommit-YARN-Build/23361/console |
Powered by | Apache Yetus 0.8.0 http://yetus.apache.org |
This message was automatically generated.
yufeigu, bsteinbach. I attached new patch, can you help me review this?
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 16s | Docker mode activated. |
Prechecks | |||
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 1 new or modified test files. |
trunk Compile Tests | |||
+1 | mvninstall | 16m 11s | trunk passed |
+1 | compile | 0m 45s | trunk passed |
+1 | checkstyle | 0m 36s | trunk passed |
+1 | mvnsite | 0m 48s | trunk passed |
+1 | shadedclient | 12m 6s | branch has no errors when building and testing our client artifacts. |
+1 | findbugs | 1m 13s | trunk passed |
+1 | javadoc | 0m 31s | trunk passed |
Patch Compile Tests | |||
+1 | mvninstall | 0m 41s | the patch passed |
+1 | compile | 0m 38s | the patch passed |
+1 | javac | 0m 38s | the patch passed |
+1 | checkstyle | 0m 30s | the patch passed |
+1 | mvnsite | 0m 41s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | shadedclient | 11m 46s | patch has no errors when building and testing our client artifacts. |
+1 | findbugs | 1m 19s | the patch passed |
+1 | javadoc | 0m 27s | the patch passed |
Other Tests | |||
-1 | unit | 93m 8s | hadoop-yarn-server-resourcemanager in the patch failed. |
+1 | asflicense | 0m 39s | The patch does not generate ASF License warnings. |
142m 0s |
Reason | Tests |
---|---|
Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFSAppStarvation |
hadoop.yarn.server.resourcemanager.scheduler.constraint.TestPlacementConstraintsUtil |
Subsystem | Report/Notes |
---|---|
Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
JIRA Issue | YARN-8655 |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12958177/YARN-8655.002.patch |
Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle |
uname | Linux 126441f804c4 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/patchprocess/precommit/personality/provided.sh |
git revision | trunk / 965d26c |
maven | version: Apache Maven 3.3.9 |
Default Java | 1.8.0_191 |
findbugs | v3.1.0-RC1 |
unit | https://builds.apache.org/job/PreCommit-YARN-Build/23362/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt |
Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/23362/testReport/ |
Max. process+thread count | 956 (vs. ulimit of 10000) |
modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager |
Console output | https://builds.apache.org/job/PreCommit-YARN-Build/23362/console |
Powered by | Apache Yetus 0.8.0 http://yetus.apache.org |
This message was automatically generated.
Looking at how we get to adding an application to the starved list I don't think this is a thread safety issue.
I do agree that we could process the application twice. Fair share starvation and min share starvation are two different things. The queue is starved for min share and the application is starved for fair share. This does not mean that it is a problem. If the application is starved for fair share the calculation of the queue min share starvation already takes that fact into account.
The updateStarvedAppsMinshare() deducts any fair share starvation already processed for applications from the possible min share starvation. This means two things for an application that is marked for min share starvation
- the application fair share starvation is less than the distributed min share starvation of the queue
- the application has an outstanding demand that is higher than its fair share starvation
The chance that an application is starved for fair share with a demand that is higher than its fair share starvation combined with the distributed queue minimum share that is higher than the fair share starvation is small.
It could be worth the fix if it has a high impact. Looking at the way you are proposing to fix it in the patch is however not the way. You introduce a Thread.sleep() call in the pre-emption thread which is not correct. Currently the pre-emption will happen when a starved app is added and no pre-emption is in progress. With the change there is only 1 pre-emption per second. This is a high impact change and I think we need to come up with a smarter way to handle this case with less of an impact on the pre-emption itself.
wilfreds Thanks for your reply. I think it's not reasonable to process the application twice, because once we preempt containers for this app, we will satisfy both fairshareStarvation and minshareStarvation.
Resource getStarvation() {
return Resources.add(fairshareStarvation, minshareStarvation);
}
Hi uranus, I am not saying that what we do now is 100% correct. I am only doubting how often this occurs and what the impact on the application and scheduling activities is. Based on the analysis I did I think we need a solution for this case that has far less impact. Do we know any of the following:
How badly does it affect the running applications, do we pre-empt double what we should?
Does not handling this correctly slow down pre-emption?
Is there another impact of not handling the edge case?
Pre-emption currently runs almost continually and is gated by the take(): when there is a pre-emption waiting we handle it. The patch changes this into one pre-emption per second. It effectively throttles down the pre-emption from processing applications based on their arrival to slow scheduled trickle.
When I look at how we calculate and decide if the application is marked as minimum share starved the cases should be limited. Even if the application is fair share starved and the queue is min share starved we do not automatically mark the application as min share starved. We thus only have this edge case for a small number of applications.
Fixing that edge case by slowing down all pre-emption handling is what I think is not right.
wilfreds, I accidentally discovered this problem in our production cluster about a few months ago. I think it's enough to satisfy fair share starvation, so I removed min share starvation to fix this problem finally.
I just learned that the community will also abolish min share in the future. After YARN-9066, this issue will no longer be needed.
Thanks for your reply.
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 43s | Docker mode activated. |
Prechecks | |||
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 1 new or modified test files. |
trunk Compile Tests | |||
+1 | mvninstall | 21m 32s | trunk passed |
+1 | compile | 0m 44s | trunk passed |
+1 | checkstyle | 0m 34s | trunk passed |
+1 | mvnsite | 0m 47s | trunk passed |
+1 | shadedclient | 15m 44s | branch has no errors when building and testing our client artifacts. |
+1 | findbugs | 1m 41s | trunk passed |
+1 | javadoc | 0m 30s | trunk passed |
Patch Compile Tests | |||
+1 | mvninstall | 0m 44s | the patch passed |
+1 | compile | 0m 39s | the patch passed |
+1 | javac | 0m 39s | the patch passed |
+1 | checkstyle | 0m 29s | the patch passed |
+1 | mvnsite | 0m 43s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | shadedclient | 14m 22s | patch has no errors when building and testing our client artifacts. |
+1 | findbugs | 1m 43s | the patch passed |
+1 | javadoc | 0m 26s | the patch passed |
Other Tests | |||
-1 | unit | 75m 35s | hadoop-yarn-server-resourcemanager in the patch passed. |
0 | asflicense | 0m 37s | ASF License check generated no output? |
137m 26s |
Reason | Tests |
---|---|
Failed junit tests | hadoop.yarn.server.resourcemanager.TestRMAdminService |
hadoop.yarn.server.resourcemanager.scheduler.fair.TestFSAppStarvation |
Subsystem | Report/Notes |
---|---|
Docker | Client=19.03.6 Server=19.03.6 Image:yetus/hadoop:c44943d1fc3 |
JIRA Issue | YARN-8655 |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12958177/YARN-8655.002.patch |
Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle |
uname | Linux 223731e8bbb0 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/patchprocess/precommit/personality/provided.sh |
git revision | trunk / 900430b |
maven | version: Apache Maven 3.3.9 |
Default Java | 1.8.0_242 |
findbugs | v3.1.0-RC1 |
unit | https://builds.apache.org/job/PreCommit-YARN-Build/25569/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt |
Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/25569/testReport/ |
Max. process+thread count | 814 (vs. ulimit of 5500) |
modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager |
Console output | https://builds.apache.org/job/PreCommit-YARN-Build/25569/console |
Powered by | Apache Yetus 0.8.0 http://yetus.apache.org |
This message was automatically generated.
This message was automatically generated.