Details

    • Sub-task
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • 3.0.0
    • None
    • yarn
    • None
    • Patch

    Description

      FSStarvedApps is not thread safe, this may make one starve app is processed for two times continuously.

      For example, when app1 is fair share starved, it has been added to appsToProcess. After that, app1 is taken but appBeingProcessed is not yet update to app1. At the moment, app1 is starved by min share, so this app is added to appsToProcess again! Because appBeingProcessed is null and appsToProcess also have not this one. 

      void addStarvedApp(FSAppAttempt app) {
      if (!app.equals(appBeingProcessed) && !appsToProcess.contains(app)) {
      appsToProcess.add(app);
      }
      }
      
      FSAppAttempt take() throws InterruptedException {
        // Reset appBeingProcessed before the blocking call
        appBeingProcessed = null;
      
        // Blocking call to fetch the next starved application
        FSAppAttempt app = appsToProcess.take();
        appBeingProcessed = app;
        return app;
      }
      

       

      Attachments

        1. YARN-8655.002.patch
          5 kB
          Zhaohui Xin
        2. YARN-8655.patch
          5 kB
          Zhaohui Xin

        Issue Links

          Activity

            genericqa genericqa added a comment -
            +1 overall



            Vote Subsystem Runtime Comment
            0 reexec 0m 25s Docker mode activated.
                  Prechecks
            +1 @author 0m 0s The patch does not contain any @author tags.
            +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
                  trunk Compile Tests
            +1 mvninstall 27m 24s trunk passed
            +1 compile 0m 44s trunk passed
            +1 checkstyle 0m 15s trunk passed
            +1 mvnsite 0m 47s trunk passed
            +1 shadedclient 12m 21s branch has no errors when building and testing our client artifacts.
            +1 findbugs 1m 17s trunk passed
            +1 javadoc 0m 30s trunk passed
                  Patch Compile Tests
            +1 mvninstall 0m 46s the patch passed
            +1 compile 0m 41s the patch passed
            +1 javac 0m 41s the patch passed
            +1 checkstyle 0m 10s the patch passed
            +1 mvnsite 0m 44s the patch passed
            +1 whitespace 0m 0s The patch has no whitespace issues.
            +1 shadedclient 12m 30s patch has no errors when building and testing our client artifacts.
            +1 findbugs 1m 29s the patch passed
            +1 javadoc 0m 28s the patch passed
                  Other Tests
            +1 unit 70m 27s hadoop-yarn-server-resourcemanager in the patch passed.
            +1 asflicense 0m 24s The patch does not generate ASF License warnings.
            131m 33s



            Subsystem Report/Notes
            Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08
            JIRA Issue YARN-8655
            JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12935382/YARN-8655.patch
            Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
            uname Linux a73ca55ef35d 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
            Build tool maven
            Personality /testptch/patchprocess/precommit/personality/provided.sh
            git revision trunk / a13929d
            maven version: Apache Maven 3.3.9
            Default Java 1.8.0_171
            findbugs v3.1.0-RC1
            Test Results https://builds.apache.org/job/PreCommit-YARN-Build/21585/testReport/
            Max. process+thread count 938 (vs. ulimit of 10000)
            modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
            Console output https://builds.apache.org/job/PreCommit-YARN-Build/21585/console
            Powered by Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org

            This message was automatically generated.

            genericqa genericqa added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 25s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.       trunk Compile Tests +1 mvninstall 27m 24s trunk passed +1 compile 0m 44s trunk passed +1 checkstyle 0m 15s trunk passed +1 mvnsite 0m 47s trunk passed +1 shadedclient 12m 21s branch has no errors when building and testing our client artifacts. +1 findbugs 1m 17s trunk passed +1 javadoc 0m 30s trunk passed       Patch Compile Tests +1 mvninstall 0m 46s the patch passed +1 compile 0m 41s the patch passed +1 javac 0m 41s the patch passed +1 checkstyle 0m 10s the patch passed +1 mvnsite 0m 44s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 shadedclient 12m 30s patch has no errors when building and testing our client artifacts. +1 findbugs 1m 29s the patch passed +1 javadoc 0m 28s the patch passed       Other Tests +1 unit 70m 27s hadoop-yarn-server-resourcemanager in the patch passed. +1 asflicense 0m 24s The patch does not generate ASF License warnings. 131m 33s Subsystem Report/Notes Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 JIRA Issue YARN-8655 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12935382/YARN-8655.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle uname Linux a73ca55ef35d 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/patchprocess/precommit/personality/provided.sh git revision trunk / a13929d maven version: Apache Maven 3.3.9 Default Java 1.8.0_171 findbugs v3.1.0-RC1 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/21585/testReport/ Max. process+thread count 938 (vs. ulimit of 10000) modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/21585/console Powered by Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
            uranus Zhaohui Xin added a comment -

            Hi, yufeigu , Can you assign this issue to me? 

            uranus Zhaohui Xin added a comment - Hi, yufeigu  , Can you assign this issue to me? 
            yufeigu Yufei Gu added a comment -

            Hi uranus, added you to the contributor list and assigned this to you.

            yufeigu Yufei Gu added a comment - Hi uranus , added you to the contributor list and assigned this to you.

            Hi uranus,

            Thanks for uploading the patch.

            If I understand correctly the reason to replace take() with poll() is that you don't want to do synchronization on a blocking call.

            I have some suggestions about it:

            1) Method name FSStarvedApps.take() is not consistent since it is a poll not a take

            2) Comments tell in FSStarvedApps that take is a blocking method

            3) The behavior of the class changed, in your solution the thread will run like an endless loop very often and will do a high-cost lock in every iteration. Maybe it would worth to add some sleep in the loop just like before.

            bsteinbach Antal Bálint Steinbach added a comment - Hi uranus , Thanks for uploading the patch. If I understand correctly the reason to replace take() with poll() is that you don't want to do synchronization on a blocking call. I have some suggestions about it: 1) Method name FSStarvedApps.take() is not consistent since it is a poll not a take 2) Comments tell in FSStarvedApps that take is a blocking method 3) The behavior of the class changed, in your solution the thread will run like an endless loop very often and will do a high-cost lock in every iteration. Maybe it would worth to add some sleep in the loop just like before.
            hadoopqa Hadoop QA added a comment -
            -1 overall



            Vote Subsystem Runtime Comment
            0 reexec 0m 25s Docker mode activated.
                  Prechecks
            +1 @author 0m 0s The patch does not contain any @author tags.
            +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
                  trunk Compile Tests
            +1 mvninstall 18m 0s trunk passed
            +1 compile 0m 46s trunk passed
            +1 checkstyle 0m 37s trunk passed
            +1 mvnsite 0m 49s trunk passed
            +1 shadedclient 12m 33s branch has no errors when building and testing our client artifacts.
            +1 findbugs 1m 18s trunk passed
            +1 javadoc 0m 29s trunk passed
                  Patch Compile Tests
            +1 mvninstall 0m 45s the patch passed
            +1 compile 0m 42s the patch passed
            +1 javac 0m 42s the patch passed
            +1 checkstyle 0m 34s the patch passed
            +1 mvnsite 0m 46s the patch passed
            +1 whitespace 0m 0s The patch has no whitespace issues.
            +1 shadedclient 12m 26s patch has no errors when building and testing our client artifacts.
            +1 findbugs 1m 24s the patch passed
            +1 javadoc 0m 26s the patch passed
                  Other Tests
            -1 unit 92m 57s hadoop-yarn-server-resourcemanager in the patch failed.
            +1 asflicense 0m 24s The patch does not generate ASF License warnings.
            144m 58s



            Subsystem Report/Notes
            Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f
            JIRA Issue YARN-8655
            JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12935382/YARN-8655.patch
            Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
            uname Linux ddcf50866132 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
            Build tool maven
            Personality /testptch/patchprocess/precommit/personality/provided.sh
            git revision trunk / 965d26c
            maven version: Apache Maven 3.3.9
            Default Java 1.8.0_191
            findbugs v3.1.0-RC1
            unit https://builds.apache.org/job/PreCommit-YARN-Build/23361/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
            Test Results https://builds.apache.org/job/PreCommit-YARN-Build/23361/testReport/
            Max. process+thread count 957 (vs. ulimit of 10000)
            modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
            Console output https://builds.apache.org/job/PreCommit-YARN-Build/23361/console
            Powered by Apache Yetus 0.8.0 http://yetus.apache.org

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 25s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.       trunk Compile Tests +1 mvninstall 18m 0s trunk passed +1 compile 0m 46s trunk passed +1 checkstyle 0m 37s trunk passed +1 mvnsite 0m 49s trunk passed +1 shadedclient 12m 33s branch has no errors when building and testing our client artifacts. +1 findbugs 1m 18s trunk passed +1 javadoc 0m 29s trunk passed       Patch Compile Tests +1 mvninstall 0m 45s the patch passed +1 compile 0m 42s the patch passed +1 javac 0m 42s the patch passed +1 checkstyle 0m 34s the patch passed +1 mvnsite 0m 46s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 shadedclient 12m 26s patch has no errors when building and testing our client artifacts. +1 findbugs 1m 24s the patch passed +1 javadoc 0m 26s the patch passed       Other Tests -1 unit 92m 57s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 24s The patch does not generate ASF License warnings. 144m 58s Subsystem Report/Notes Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f JIRA Issue YARN-8655 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12935382/YARN-8655.patch Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle uname Linux ddcf50866132 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/patchprocess/precommit/personality/provided.sh git revision trunk / 965d26c maven version: Apache Maven 3.3.9 Default Java 1.8.0_191 findbugs v3.1.0-RC1 unit https://builds.apache.org/job/PreCommit-YARN-Build/23361/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/23361/testReport/ Max. process+thread count 957 (vs. ulimit of 10000) modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/23361/console Powered by Apache Yetus 0.8.0 http://yetus.apache.org This message was automatically generated.
            uranus Zhaohui Xin added a comment -

            yufeigu, bsteinbach. I attached new patch, can you help me review this? 

            uranus Zhaohui Xin added a comment - yufeigu , bsteinbach . I attached new patch, can you help me review this? 
            hadoopqa Hadoop QA added a comment -
            -1 overall



            Vote Subsystem Runtime Comment
            0 reexec 0m 16s Docker mode activated.
                  Prechecks
            +1 @author 0m 0s The patch does not contain any @author tags.
            +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
                  trunk Compile Tests
            +1 mvninstall 16m 11s trunk passed
            +1 compile 0m 45s trunk passed
            +1 checkstyle 0m 36s trunk passed
            +1 mvnsite 0m 48s trunk passed
            +1 shadedclient 12m 6s branch has no errors when building and testing our client artifacts.
            +1 findbugs 1m 13s trunk passed
            +1 javadoc 0m 31s trunk passed
                  Patch Compile Tests
            +1 mvninstall 0m 41s the patch passed
            +1 compile 0m 38s the patch passed
            +1 javac 0m 38s the patch passed
            +1 checkstyle 0m 30s the patch passed
            +1 mvnsite 0m 41s the patch passed
            +1 whitespace 0m 0s The patch has no whitespace issues.
            +1 shadedclient 11m 46s patch has no errors when building and testing our client artifacts.
            +1 findbugs 1m 19s the patch passed
            +1 javadoc 0m 27s the patch passed
                  Other Tests
            -1 unit 93m 8s hadoop-yarn-server-resourcemanager in the patch failed.
            +1 asflicense 0m 39s The patch does not generate ASF License warnings.
            142m 0s



            Reason Tests
            Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.fair.TestFSAppStarvation
              hadoop.yarn.server.resourcemanager.scheduler.constraint.TestPlacementConstraintsUtil



            Subsystem Report/Notes
            Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f
            JIRA Issue YARN-8655
            JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12958177/YARN-8655.002.patch
            Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
            uname Linux 126441f804c4 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
            Build tool maven
            Personality /testptch/patchprocess/precommit/personality/provided.sh
            git revision trunk / 965d26c
            maven version: Apache Maven 3.3.9
            Default Java 1.8.0_191
            findbugs v3.1.0-RC1
            unit https://builds.apache.org/job/PreCommit-YARN-Build/23362/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
            Test Results https://builds.apache.org/job/PreCommit-YARN-Build/23362/testReport/
            Max. process+thread count 956 (vs. ulimit of 10000)
            modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
            Console output https://builds.apache.org/job/PreCommit-YARN-Build/23362/console
            Powered by Apache Yetus 0.8.0 http://yetus.apache.org

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 16s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.       trunk Compile Tests +1 mvninstall 16m 11s trunk passed +1 compile 0m 45s trunk passed +1 checkstyle 0m 36s trunk passed +1 mvnsite 0m 48s trunk passed +1 shadedclient 12m 6s branch has no errors when building and testing our client artifacts. +1 findbugs 1m 13s trunk passed +1 javadoc 0m 31s trunk passed       Patch Compile Tests +1 mvninstall 0m 41s the patch passed +1 compile 0m 38s the patch passed +1 javac 0m 38s the patch passed +1 checkstyle 0m 30s the patch passed +1 mvnsite 0m 41s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 shadedclient 11m 46s patch has no errors when building and testing our client artifacts. +1 findbugs 1m 19s the patch passed +1 javadoc 0m 27s the patch passed       Other Tests -1 unit 93m 8s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 39s The patch does not generate ASF License warnings. 142m 0s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.fair.TestFSAppStarvation   hadoop.yarn.server.resourcemanager.scheduler.constraint.TestPlacementConstraintsUtil Subsystem Report/Notes Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f JIRA Issue YARN-8655 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12958177/YARN-8655.002.patch Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle uname Linux 126441f804c4 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/patchprocess/precommit/personality/provided.sh git revision trunk / 965d26c maven version: Apache Maven 3.3.9 Default Java 1.8.0_191 findbugs v3.1.0-RC1 unit https://builds.apache.org/job/PreCommit-YARN-Build/23362/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/23362/testReport/ Max. process+thread count 956 (vs. ulimit of 10000) modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/23362/console Powered by Apache Yetus 0.8.0 http://yetus.apache.org This message was automatically generated.
            wilfreds Wilfred Spiegelenburg added a comment - - edited

            Looking at how we get to adding an application to the starved list I don't think this is a thread safety issue.

            I do agree that we could process the application twice. Fair share starvation and min share starvation are two different things. The queue is starved for min share and the application is starved for fair share. This does not mean that it is a problem. If the application is starved for fair share the calculation of the queue min share starvation already takes that fact into account.

            The updateStarvedAppsMinshare() deducts any fair share starvation already processed for applications from the possible min share starvation. This means two things for an application that is marked for min share starvation

            1. the application fair share starvation is less than the distributed min share starvation of the queue
            2. the application has an outstanding demand that is higher than its fair share starvation

            The chance that an application is starved for fair share with a demand that is higher than its fair share starvation combined with the distributed queue minimum share that is higher than the fair share starvation is small.

            It could be worth the fix if it has a high impact. Looking at the way you are proposing to fix it in the patch is however not the way. You introduce a Thread.sleep() call in the pre-emption thread which is not correct. Currently the pre-emption will happen when a starved app is added and no pre-emption is in progress. With the change there is only 1 pre-emption per second. This is a high impact change and I think we need to come up with a smarter way to handle this case with less of an impact on the pre-emption itself.

            wilfreds Wilfred Spiegelenburg added a comment - - edited Looking at how we get to adding an application to the starved list I don't think this is a thread safety issue. I do agree that we could process the application twice. Fair share starvation and min share starvation are two different things. The queue is starved for min share and the application is starved for fair share. This does not mean that it is a problem. If the application is starved for fair share the calculation of the queue min share starvation already takes that fact into account. The updateStarvedAppsMinshare() deducts any fair share starvation already processed for applications from the possible min share starvation. This means two things for an application that is marked for min share starvation the application fair share starvation is less than the distributed min share starvation of the queue the application has an outstanding demand that is higher than its fair share starvation The chance that an application is starved for fair share with a demand that is higher than its fair share starvation combined with the distributed queue minimum share that is higher than the fair share starvation is small. It could be worth the fix if it has a high impact. Looking at the way you are proposing to fix it in the patch is however not the way. You introduce a Thread.sleep() call in the pre-emption thread which is not correct. Currently the pre-emption will happen when a starved app is added and no pre-emption is in progress. With the change there is only 1 pre-emption per second. This is a high impact change and I think we need to come up with a smarter way to handle this case with less of an impact on the pre-emption itself.
            uranus Zhaohui Xin added a comment - - edited

            wilfreds Thanks for your reply. I think it's not reasonable to process the application twice, because once we preempt containers for this app, we will satisfy both fairshareStarvation  and minshareStarvation.

            Resource getStarvation() {
              return Resources.add(fairshareStarvation, minshareStarvation);
            }
            
            uranus Zhaohui Xin added a comment - - edited wilfreds Thanks for your reply. I think it's not reasonable to process the application twice, because once we preempt containers for this app, we will satisfy both fairshareStarvation  and minshareStarvation. Resource getStarvation() { return Resources.add(fairshareStarvation, minshareStarvation); }
            wilfreds Wilfred Spiegelenburg added a comment - - edited

            Hi uranus, I am not saying that what we do now is 100% correct. I am only doubting how often this occurs and what the impact on the application and scheduling activities is. Based on the analysis I did I think we need a solution for this case that has far less impact. Do we know any of the following:
            How badly does it affect the running applications, do we pre-empt double what we should?
            Does not handling this correctly slow down pre-emption?
            Is there another impact of not handling the edge case?

            Pre-emption currently runs almost continually and is gated by the take(): when there is a pre-emption waiting we handle it. The patch changes this into one pre-emption per second. It effectively throttles down the pre-emption from processing applications based on their arrival to slow scheduled trickle.
            When I look at how we calculate and decide if the application is marked as minimum share starved the cases should be limited. Even if the application is fair share starved and the queue is min share starved we do not automatically mark the application as min share starved. We thus only have this edge case for a small number of applications.
            Fixing that edge case by slowing down all pre-emption handling is what I think is not right.

            wilfreds Wilfred Spiegelenburg added a comment - - edited Hi uranus , I am not saying that what we do now is 100% correct. I am only doubting how often this occurs and what the impact on the application and scheduling activities is. Based on the analysis I did I think we need a solution for this case that has far less impact. Do we know any of the following: How badly does it affect the running applications, do we pre-empt double what we should? Does not handling this correctly slow down pre-emption? Is there another impact of not handling the edge case? Pre-emption currently runs almost continually and is gated by the take() : when there is a pre-emption waiting we handle it. The patch changes this into one pre-emption per second. It effectively throttles down the pre-emption from processing applications based on their arrival to slow scheduled trickle. When I look at how we calculate and decide if the application is marked as minimum share starved the cases should be limited. Even if the application is fair share starved and the queue is min share starved we do not automatically mark the application as min share starved. We thus only have this edge case for a small number of applications. Fixing that edge case by slowing down all pre-emption handling is what I think is not right.
            uranus Zhaohui Xin added a comment - - edited

            wilfreds, I accidentally discovered this problem in our production cluster about a few months ago. I think it's enough to satisfy fair share starvation, so I removed min share starvation to fix this problem finally. 

            I just learned that the community will also abolish min share in the future. After YARN-9066, this issue will no longer be needed.

            Thanks for your reply. 

            uranus Zhaohui Xin added a comment - - edited wilfreds , I accidentally discovered this problem in our production cluster about a few months ago. I think it's enough to satisfy fair share starvation, so I removed min share starvation to fix this problem finally.   I just learned that the community will also abolish min share in the future. After  YARN-9066 , this issue will no longer be needed. Thanks for your reply. 
            hadoopqa Hadoop QA added a comment -
            -1 overall



            Vote Subsystem Runtime Comment
            0 reexec 0m 43s Docker mode activated.
                  Prechecks
            +1 @author 0m 0s The patch does not contain any @author tags.
            +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
                  trunk Compile Tests
            +1 mvninstall 21m 32s trunk passed
            +1 compile 0m 44s trunk passed
            +1 checkstyle 0m 34s trunk passed
            +1 mvnsite 0m 47s trunk passed
            +1 shadedclient 15m 44s branch has no errors when building and testing our client artifacts.
            +1 findbugs 1m 41s trunk passed
            +1 javadoc 0m 30s trunk passed
                  Patch Compile Tests
            +1 mvninstall 0m 44s the patch passed
            +1 compile 0m 39s the patch passed
            +1 javac 0m 39s the patch passed
            +1 checkstyle 0m 29s the patch passed
            +1 mvnsite 0m 43s the patch passed
            +1 whitespace 0m 0s The patch has no whitespace issues.
            +1 shadedclient 14m 22s patch has no errors when building and testing our client artifacts.
            +1 findbugs 1m 43s the patch passed
            +1 javadoc 0m 26s the patch passed
                  Other Tests
            -1 unit 75m 35s hadoop-yarn-server-resourcemanager in the patch passed.
            0 asflicense 0m 37s ASF License check generated no output?
            137m 26s



            Reason Tests
            Failed junit tests hadoop.yarn.server.resourcemanager.TestRMAdminService
              hadoop.yarn.server.resourcemanager.scheduler.fair.TestFSAppStarvation



            Subsystem Report/Notes
            Docker Client=19.03.6 Server=19.03.6 Image:yetus/hadoop:c44943d1fc3
            JIRA Issue YARN-8655
            JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12958177/YARN-8655.002.patch
            Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
            uname Linux 223731e8bbb0 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
            Build tool maven
            Personality /testptch/patchprocess/precommit/personality/provided.sh
            git revision trunk / 900430b
            maven version: Apache Maven 3.3.9
            Default Java 1.8.0_242
            findbugs v3.1.0-RC1
            unit https://builds.apache.org/job/PreCommit-YARN-Build/25569/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
            Test Results https://builds.apache.org/job/PreCommit-YARN-Build/25569/testReport/
            Max. process+thread count 814 (vs. ulimit of 5500)
            modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
            Console output https://builds.apache.org/job/PreCommit-YARN-Build/25569/console
            Powered by Apache Yetus 0.8.0 http://yetus.apache.org

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 43s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.       trunk Compile Tests +1 mvninstall 21m 32s trunk passed +1 compile 0m 44s trunk passed +1 checkstyle 0m 34s trunk passed +1 mvnsite 0m 47s trunk passed +1 shadedclient 15m 44s branch has no errors when building and testing our client artifacts. +1 findbugs 1m 41s trunk passed +1 javadoc 0m 30s trunk passed       Patch Compile Tests +1 mvninstall 0m 44s the patch passed +1 compile 0m 39s the patch passed +1 javac 0m 39s the patch passed +1 checkstyle 0m 29s the patch passed +1 mvnsite 0m 43s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 shadedclient 14m 22s patch has no errors when building and testing our client artifacts. +1 findbugs 1m 43s the patch passed +1 javadoc 0m 26s the patch passed       Other Tests -1 unit 75m 35s hadoop-yarn-server-resourcemanager in the patch passed. 0 asflicense 0m 37s ASF License check generated no output? 137m 26s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.TestRMAdminService   hadoop.yarn.server.resourcemanager.scheduler.fair.TestFSAppStarvation Subsystem Report/Notes Docker Client=19.03.6 Server=19.03.6 Image:yetus/hadoop:c44943d1fc3 JIRA Issue YARN-8655 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12958177/YARN-8655.002.patch Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle uname Linux 223731e8bbb0 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/patchprocess/precommit/personality/provided.sh git revision trunk / 900430b maven version: Apache Maven 3.3.9 Default Java 1.8.0_242 findbugs v3.1.0-RC1 unit https://builds.apache.org/job/PreCommit-YARN-Build/25569/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/25569/testReport/ Max. process+thread count 814 (vs. ulimit of 5500) modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/25569/console Powered by Apache Yetus 0.8.0 http://yetus.apache.org This message was automatically generated.

            People

              uranus Zhaohui Xin
              uranus Zhaohui Xin
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: