Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4743

FairSharePolicy breaks TimSort assumption

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0-alpha2
    • Fix Version/s: 2.9.0, 3.0.0-alpha2
    • Component/s: fairscheduler
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      2016-02-26 14:08:50,821 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler
      java.lang.IllegalArgumentException: Comparison method violates its general contract!
               at java.util.TimSort.mergeHi(TimSort.java:868)
               at java.util.TimSort.mergeAt(TimSort.java:485)
               at java.util.TimSort.mergeCollapse(TimSort.java:410)
               at java.util.TimSort.sort(TimSort.java:214)
               at java.util.TimSort.sort(TimSort.java:173)
               at java.util.Arrays.sort(Arrays.java:659)
               at java.util.Collections.sort(Collections.java:217)
               at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:316)
               at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:240)
               at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1091)
               at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:989)
               at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1185)
               at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
               at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684)
               at java.lang.Thread.run(Thread.java:745)
      2016-02-26 14:08:50,822 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
      

      Actually, this bug found in 2.6.0-cdh5.4.7. FairShareComparator is not transitive.

      We get NaN when memorySize=0 and weight=0.

      FairSharePolicy.java
      useToWeightRatio1 = s1.getResourceUsage().getMemorySize() /
        s1.getWeights().getWeight(ResourceType.MEMORY)
      
      1. timsort.log
        94 kB
        Zephyr Guo
      2. YARN-4743-v1.patch
        9 kB
        Zephyr Guo
      3. YARN-4743-v2.patch
        10 kB
        Zephyr Guo
      4. YARN-4743-v3.patch
        12 kB
        Zephyr Guo
      5. YARN-4743-v4.patch
        12 kB
        Zephyr Guo
      6. YARN-4743-v5.patch
        12 kB
        Yufei Gu

        Issue Links

          Activity

          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Zephyr Guo thank you for the report. IIUC, the comparator must ensure that the relation be transitive.

          https://docs.oracle.com/javase/7/docs/api/java/lang/Comparable.html

          I think that DRF comparator is not transitive with my intuition. Karthik Kambatla, what do you think? Can we design the comparator as transitive comparator?

          Show
          ozawa Tsuyoshi Ozawa added a comment - Zephyr Guo thank you for the report. IIUC, the comparator must ensure that the relation be transitive. https://docs.oracle.com/javase/7/docs/api/java/lang/Comparable.html I think that DRF comparator is not transitive with my intuition. Karthik Kambatla , what do you think? Can we design the comparator as transitive comparator?
          Hide
          gzh1992n Zephyr Guo added a comment -

          I think that DRF comparator is not transitive with my intuition.

          I think that's right.Tsuyoshi Ozawa

          FairShareComparator uses getResourceUsage() and getDemand() and getMinShare() to implement compare(Schedulable s1, Schedulable s1).The three methods must return same Resource anyway while we are sorting, otherwise will break transitivity.

          How about add snapshot feature in Schedulable? We snapshot Schedulable before sorting.Then we sort but use snapshot Resource in comparator . Result of sorting will very close to real situation, because sorting is very fast.

          Show
          gzh1992n Zephyr Guo added a comment - I think that DRF comparator is not transitive with my intuition. I think that's right. Tsuyoshi Ozawa FairShareComparator uses getResourceUsage() and getDemand() and getMinShare() to implement compare(Schedulable s1, Schedulable s1) .The three methods must return same Resource anyway while we are sorting, otherwise will break transitivity. How about add snapshot feature in Schedulable? We snapshot Schedulable before sorting.Then we sort but use snapshot Resource in comparator . Result of sorting will very close to real situation, because sorting is very fast.
          Hide
          kasha Karthik Kambatla added a comment -

          Zephyr Guo - thanks for reporting and working on this. I haven't had a chance to look at it closely enough. Will take me a couple of days to do so.

          On the surface, it seems benign to sort a snapshot of Schedulables. The other way would be to use ReadWriteLock in FSQueue: getters would all try to get a readLock while the sort holds the write lock?

          Show
          kasha Karthik Kambatla added a comment - Zephyr Guo - thanks for reporting and working on this. I haven't had a chance to look at it closely enough. Will take me a couple of days to do so. On the surface, it seems benign to sort a snapshot of Schedulables. The other way would be to use ReadWriteLock in FSQueue: getters would all try to get a readLock while the sort holds the write lock?
          Hide
          gzh1992n Zephyr Guo added a comment -

          I am trying to solve the issue, but I am failed.
          In my opinion, the issue cause by concurrent operation on FSAppAttempt.When FSLeafQueue is sorting FSAppAttempt, the inner Resource of FsAppAttempt is modified.In this case, FairShareComparator may cannot work correctly.Base on this idea, I write YARN-4743-cdh5.4.7.patch(I have attached).The patch use snapshot to protect elements during the sorting.Sadly, this problem doesn't resolve with the patch.I got same exception on sorting and more frequently crash.I begin to doubt whether the comparator have a problem really.I reviewed FairShareComparator code and simulate all cases, but did not found any bugs.

          I need some idea. I'd like to verify two things.1)Can inner Resource be modified during the sorting?Who could review it for me? 2)Does comparator also have mistakes really or my patch is incorrect?

          I doubt that float-point precision in comparator, but it's hard to reappear in test cluster(never reappear). It happen with low probability in larger cluster.

          Show
          gzh1992n Zephyr Guo added a comment - I am trying to solve the issue, but I am failed. In my opinion, the issue cause by concurrent operation on FSAppAttempt .When FSLeafQueue is sorting FSAppAttempt, the inner Resource of FsAppAttempt is modified.In this case, FairShareComparator may cannot work correctly.Base on this idea, I write YARN-4743 -cdh5.4.7.patch(I have attached).The patch use snapshot to protect elements during the sorting.Sadly, this problem doesn't resolve with the patch.I got same exception on sorting and more frequently crash.I begin to doubt whether the comparator have a problem really.I reviewed FairShareComparator code and simulate all cases, but did not found any bugs. I need some idea. I'd like to verify two things.1)Can inner Resource be modified during the sorting?Who could review it for me? 2)Does comparator also have mistakes really or my patch is incorrect? I doubt that float-point precision in comparator, but it's hard to reappear in test cluster(never reappear). It happen with low probability in larger cluster.
          Hide
          sandflee sandflee added a comment - - edited

          I don't think snapshot could resolve this, as in YARN-5371, node is only sorted with unused resource. this seems caused by a > b, and b > c, but while sorting a and c, a < c. we should snapshot all sorting element and then sort to avoid this, or could add -Djava.util.Arrays.useLegacyMergeSort=true to YARN_OPS to use mergeSort not TimSort for Collection#sort.

          Show
          sandflee sandflee added a comment - - edited I don't think snapshot could resolve this, as in YARN-5371 , node is only sorted with unused resource. this seems caused by a > b, and b > c, but while sorting a and c, a < c. we should snapshot all sorting element and then sort to avoid this, or could add -Djava.util.Arrays.useLegacyMergeSort=true to YARN_OPS to use mergeSort not TimSort for Collection#sort.
          Hide
          xietingwen Xie Tingwen added a comment -

          sandflee I also met this problem when sort NodeIdList in FairScheduler#continuousSchedulingAttempt(),
          I think the root cause is NodeAvailableResourceComparator is not transitive.

          Show
          xietingwen Xie Tingwen added a comment - sandflee I also met this problem when sort NodeIdList in FairScheduler#continuousSchedulingAttempt(), I think the root cause is NodeAvailableResourceComparator is not transitive.
          Hide
          sandflee sandflee added a comment -

          I think the root cause is NodeAvailableResourceComparator is not transitive.

          agree, but do we really need comparator to be transitive? it may reduce the perform greatly.

          Show
          sandflee sandflee added a comment - I think the root cause is NodeAvailableResourceComparator is not transitive. agree, but do we really need comparator to be transitive? it may reduce the perform greatly.
          Hide
          benedict jin Benedict Jin added a comment - - edited

          感觉这是一个 jdk本身的漏洞,比较器里面 相比较的两个值 如果同时为空的话,传入的顺序可能决定了返回值 的结果,破坏了 "传递性"

          JDK-6804124 : (coll) Replace "modified mergesort" in java.util.Arrays.sort with timsort
          http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6804124

          可以在 jvm中配置 java.util.Arrays.useLegacyMergeSort=true

          [或者在程序中 System.setProperty("java.util.Arrays.useLegacyMergeSort", "true")]

          Show
          benedict jin Benedict Jin added a comment - - edited 感觉这是一个 jdk本身的漏洞,比较器里面 相比较的两个值 如果同时为空的话,传入的顺序可能决定了返回值 的结果,破坏了 "传递性" JDK-6804124 : (coll) Replace "modified mergesort" in java.util.Arrays.sort with timsort http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6804124 可以在 jvm中配置 java.util.Arrays.useLegacyMergeSort=true [或者在程序中 System.setProperty("java.util.Arrays.useLegacyMergeSort", "true")]
          Hide
          yufeigu Yufei Gu added a comment -

          Thanks Benedict Jin's valuable information. Nice catch, I will try it later. Could you please rewrite it in English? So that more people can understand.

          Show
          yufeigu Yufei Gu added a comment - Thanks Benedict Jin 's valuable information. Nice catch, I will try it later. Could you please rewrite it in English? So that more people can understand.
          Hide
          yufeigu Yufei Gu added a comment -

          We don't sort running apps anymore according to YARN-3547, and the code invoking the TimSort() has been removed. For that case, I will close this one as 'Won't fix'. Feel free to reopen it if anyone has concerns.

          Show
          yufeigu Yufei Gu added a comment - We don't sort running apps anymore according to YARN-3547 , and the code invoking the TimSort() has been removed. For that case, I will close this one as 'Won't fix'. Feel free to reopen it if anyone has concerns.
          Hide
          imstefanlee stefanlee added a comment -

          Thanks Yufei Gu ,i also met this problem,my scenario is that i decommission all my cluster nodemangers,after that ,thread "continuous scheduling" was down,then i found this exception was happened in "Collections.sort" in thread "continous scheduling" .

          Show
          imstefanlee stefanlee added a comment - Thanks Yufei Gu ,i also met this problem,my scenario is that i decommission all my cluster nodemangers,after that ,thread "continuous scheduling" was down,then i found this exception was happened in "Collections.sort" in thread "continous scheduling" .
          Hide
          yufeigu Yufei Gu added a comment - - edited

          Hi stefanlee, continous scheduling uses the same code, I guess you got the similar stack trace. Please check your hadoop version to see if it has YARN-3547.

          Show
          yufeigu Yufei Gu added a comment - - edited Hi stefanlee , continous scheduling uses the same code, I guess you got the similar stack trace. Please check your hadoop version to see if it has YARN-3547 .
          Hide
          imstefanlee stefanlee added a comment -

          Thanks ,my hadoop version is 2.4.0 and i just found that continuousScheduling thread has removed collections.sort in hadoop-3.0.0, i will review the code carefully.

          Show
          imstefanlee stefanlee added a comment - Thanks ,my hadoop version is 2.4.0 and i just found that continuousScheduling thread has removed collections.sort in hadoop-3.0.0, i will review the code carefully.
          Hide
          gzh1992n Zephyr Guo added a comment - - edited

          Sorry for some reason did not deal with this issue for a long time. Now pick it up, we have found the bug.
          Reopen the issue.

          Show
          gzh1992n Zephyr Guo added a comment - - edited Sorry for some reason did not deal with this issue for a long time. Now pick it up, we have found the bug. Reopen the issue.
          Hide
          gzh1992n Zephyr Guo added a comment - - edited

          The bug cause by NaN.

          I wrote a test case to verify FairShareComparator(see patch), and then found that the FairShareComparator can not deal with weights=0 correctly. We dump the collection(see timsort.log) that broke sorting from our cluster to confirm whether it is 0. The weight should be greater than or equal to 1(I think). In fact, weight would be 0.

          We get NaN when memorySize=0 and weight=0.

          useToWeightRatio1 = s1.getResourceUsage().getMemorySize() /
            s1.getWeights().getWeight(ResourceType.MEMORY)
          

          I'm not sure whether this is a bug for weight.We can talk about this in another issue.
          If weight = 0 , the demand memory must be 0 and yarn.scheduler.fair.sizebasedweight is enable.
          Formula: weight = log2(1 + demand)

          it seems that a meaningful weight must be greater than or equal to 1. So I just fix weight to 1 in patch. Anyway we need more strict code.

          BTW:
          I think there are still problems related to concurrency(Like the description says that).
          If you enable yarn.resourcemanager.work-preserving-recovery.enabled(default is false in my version), recoverContainer method would be invoked in another thread. The method can modify attemptResourceUsage. This will go wrong when you are sorting FSAppAttempt.
          If as I think, we may need to create a new issue to fix this.

          Show
          gzh1992n Zephyr Guo added a comment - - edited The bug cause by NaN. I wrote a test case to verify FairShareComparator (see patch), and then found that the FairShareComparator can not deal with weights=0 correctly. We dump the collection(see timsort.log) that broke sorting from our cluster to confirm whether it is 0. The weight should be greater than or equal to 1(I think). In fact, weight would be 0. We get NaN when memorySize=0 and weight=0. useToWeightRatio1 = s1.getResourceUsage().getMemorySize() / s1.getWeights().getWeight(ResourceType.MEMORY) I'm not sure whether this is a bug for weight.We can talk about this in another issue. If weight = 0 , the demand memory must be 0 and yarn.scheduler.fair.sizebasedweight is enable. Formula: weight = log2(1 + demand) it seems that a meaningful weight must be greater than or equal to 1. So I just fix weight to 1 in patch. Anyway we need more strict code. BTW: I think there are still problems related to concurrency(Like the description says that). If you enable yarn.resourcemanager.work-preserving-recovery.enabled (default is false in my version), recoverContainer method would be invoked in another thread. The method can modify attemptResourceUsage . This will go wrong when you are sorting FSAppAttempt . If as I think, we may need to create a new issue to fix this.
          Hide
          gzh1992n Zephyr Guo added a comment -

          Why demand is 0? Because updateDemand is invoked after adding FSAppAttempt to runnableApps list.

          Show
          gzh1992n Zephyr Guo added a comment - Why demand is 0? Because updateDemand is invoked after adding FSAppAttempt to runnableApps list.
          Hide
          gzh1992n Zephyr Guo added a comment - - edited

          I don't recommed setting java.util.Arrays.useLegacyMergeSort=true.The way just avoid getting RuntimeException, but can not avoid getting a wrong order of sorting result. In the trunk version, although we use a tree structure to sort, but can't avoid getting a wrong result either.

          Show
          gzh1992n Zephyr Guo added a comment - - edited I don't recommed setting java.util.Arrays.useLegacyMergeSort=true .The way just avoid getting RuntimeException, but can not avoid getting a wrong order of sorting result. In the trunk version, although we use a tree structure to sort, but can't avoid getting a wrong result either.
          Hide
          imstefanlee stefanlee added a comment - - edited

          thanks Zephyr Guo ,the patch you fixed is that the exception happended in FairSharePolicy,but my scenario is that Collections.sort(nodeIdList, nodeAvailableResourceComparator) throws exception when decommission nodemanagers, nodeAvailableResourceComparator only compares node available memory.

          Show
          imstefanlee stefanlee added a comment - - edited thanks Zephyr Guo ,the patch you fixed is that the exception happended in FairSharePolicy ,but my scenario is that Collections.sort(nodeIdList, nodeAvailableResourceComparator) throws exception when decommission nodemanagers, nodeAvailableResourceComparator only compares node available memory.
          Hide
          yufeigu Yufei Gu added a comment - - edited

          Hi Zephyr Guo, thanks for working on this. Some thoughts about the patch. Both 0.5(weight value less than 1.0) or 0.0 are valid value for weights in fair scheduler. Once use case of zero-weight would be that user uses the zero-weight queue to run jobs when there is no jobs for other non-zero-weight queues. So it make no sense to me to enforce weight larger than 1.0.
          If NaN affects the transitive, we can avoid NaN by other ways. For example, if the first weight is 0.0 and the second is bigger than 0.0, obviously, the second one is needier than the first one.

          Show
          yufeigu Yufei Gu added a comment - - edited Hi Zephyr Guo , thanks for working on this. Some thoughts about the patch. Both 0.5(weight value less than 1.0) or 0.0 are valid value for weights in fair scheduler. Once use case of zero-weight would be that user uses the zero-weight queue to run jobs when there is no jobs for other non-zero-weight queues. So it make no sense to me to enforce weight larger than 1.0. If NaN affects the transitive, we can avoid NaN by other ways. For example, if the first weight is 0.0 and the second is bigger than 0.0, obviously, the second one is needier than the first one.
          Hide
          gzh1992n Zephyr Guo added a comment -

          Thanks for you comment.I will submit a new patch in recent days.

          Show
          gzh1992n Zephyr Guo added a comment - Thanks for you comment.I will submit a new patch in recent days.
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Added to contributor's list and assigned to Zephyr Guo

          Show
          Naganarasimha Naganarasimha G R added a comment - Added to contributor's list and assigned to Zephyr Guo
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 16s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 8m 50s trunk passed
          +1 compile 0m 32s trunk passed
          +1 checkstyle 0m 20s trunk passed
          +1 mvnsite 0m 38s trunk passed
          +1 mvneclipse 0m 17s trunk passed
          +1 findbugs 0m 54s trunk passed
          +1 javadoc 0m 20s trunk passed
          +1 mvninstall 0m 31s the patch passed
          +1 compile 0m 29s the patch passed
          +1 javac 0m 29s the patch passed
          -1 checkstyle 0m 17s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 11 new + 8 unchanged - 0 fixed = 19 total (was 8)
          +1 mvnsite 0m 35s the patch passed
          +1 mvneclipse 0m 14s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 1s the patch passed
          +1 javadoc 0m 18s the patch passed
          +1 unit 38m 33s hadoop-yarn-server-resourcemanager in the patch passed.
          +1 asflicense 0m 15s The patch does not generate ASF License warnings.
          54m 58s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12834084/YARN-4743-v2.patch
          JIRA Issue YARN-4743
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 6e071942a91f 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 4bca385
          Default Java 1.8.0_101
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/13436/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13436/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/13436/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 16s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 8m 50s trunk passed +1 compile 0m 32s trunk passed +1 checkstyle 0m 20s trunk passed +1 mvnsite 0m 38s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 0m 54s trunk passed +1 javadoc 0m 20s trunk passed +1 mvninstall 0m 31s the patch passed +1 compile 0m 29s the patch passed +1 javac 0m 29s the patch passed -1 checkstyle 0m 17s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 11 new + 8 unchanged - 0 fixed = 19 total (was 8) +1 mvnsite 0m 35s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 1s the patch passed +1 javadoc 0m 18s the patch passed +1 unit 38m 33s hadoop-yarn-server-resourcemanager in the patch passed. +1 asflicense 0m 15s The patch does not generate ASF License warnings. 54m 58s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12834084/YARN-4743-v2.patch JIRA Issue YARN-4743 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 6e071942a91f 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 4bca385 Default Java 1.8.0_101 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/13436/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13436/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/13436/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          gzh1992n Zephyr Guo added a comment -

          Yufei Gu Could you review patch v2. Does that look ok?

          Show
          gzh1992n Zephyr Guo added a comment - Yufei Gu Could you review patch v2. Does that look ok?
          Hide
          yufeigu Yufei Gu added a comment -

          Hi Zephyr Guo, thanks for working on this.
          The patch v2 looks generally good to me. Some nits:
          1. If you want to use if-else statements, better to use weight1 == 0 instead of weight1 != 0 to get better readability. Or we can use this to avoid if-else statements

          useToWeightRatio1 = -weight1;
          useToWeightRatio2 = -weight2;
          

          2. Please describe the change in doc of function FairShareComparator.
          3. Please fixed all style issue in Hadoop QA's comment.
          4. Can we put the TestFairShareComparator into TestSchedulingPolicy, and add doc for the function in the unit test?
          5. Not sure why startTimeColloection and nameCollection are needed. Can you explain a little bit?

          Show
          yufeigu Yufei Gu added a comment - Hi Zephyr Guo , thanks for working on this. The patch v2 looks generally good to me. Some nits: 1. If you want to use if-else statements, better to use weight1 == 0 instead of weight1 != 0 to get better readability. Or we can use this to avoid if-else statements useToWeightRatio1 = -weight1; useToWeightRatio2 = -weight2; 2. Please describe the change in doc of function FairShareComparator . 3. Please fixed all style issue in Hadoop QA's comment. 4. Can we put the TestFairShareComparator into TestSchedulingPolicy , and add doc for the function in the unit test? 5. Not sure why startTimeColloection and nameCollection are needed. Can you explain a little bit?
          Hide
          gzh1992n Zephyr Guo added a comment -

          Yufei Gu, thanks for reviewing.

          5. Not sure why startTimeColloection and nameCollection are needed. Can you explain a little bit?

          Because some pieces of codes involve these two variables.

          FairShareComparator
                if (res == 0) {
                  // Apps are tied in fairness ratio. Break the tie by submit time and job
                  // name to get a deterministic ordering, which is useful for unit tests.
                  res = (int) Math.signum(s1.getStartTime() - s2.getStartTime());
                  if (res == 0)
                    res = s1.getName().compareTo(s2.getName());
                }
          

          I will submit a new patch this week.

          Show
          gzh1992n Zephyr Guo added a comment - Yufei Gu , thanks for reviewing. 5. Not sure why startTimeColloection and nameCollection are needed. Can you explain a little bit? Because some pieces of codes involve these two variables. FairShareComparator if (res == 0) { // Apps are tied in fairness ratio. Break the tie by submit time and job // name to get a deterministic ordering, which is useful for unit tests. res = ( int ) Math .signum(s1.getStartTime() - s2.getStartTime()); if (res == 0) res = s1.getName().compareTo(s2.getName()); } I will submit a new patch this week.
          Hide
          gzh1992n Zephyr Guo added a comment -

          Patch v3:
          1.Add doc
          2.Put TestFairShareComparator into TestSchedulingPolicy

          Show
          gzh1992n Zephyr Guo added a comment - Patch v3: 1.Add doc 2.Put TestFairShareComparator into TestSchedulingPolicy
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 14s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 6m 37s trunk passed
          +1 compile 0m 31s trunk passed
          +1 checkstyle 0m 20s trunk passed
          +1 mvnsite 0m 38s trunk passed
          +1 mvneclipse 0m 16s trunk passed
          +1 findbugs 0m 55s trunk passed
          +1 javadoc 0m 21s trunk passed
          +1 mvninstall 0m 31s the patch passed
          +1 compile 0m 29s the patch passed
          +1 javac 0m 29s the patch passed
          -1 checkstyle 0m 18s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 12 unchanged - 0 fixed = 13 total (was 12)
          +1 mvnsite 0m 35s the patch passed
          +1 mvneclipse 0m 14s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 6s the patch passed
          +1 javadoc 0m 17s the patch passed
          -1 unit 35m 50s hadoop-yarn-server-resourcemanager in the patch failed.
          +1 asflicense 0m 15s The patch does not generate ASF License warnings.
          50m 3s



          Reason Tests
          Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12835106/YARN-4743-v3.patch
          JIRA Issue YARN-4743
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux be86ad9498f5 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / dbd2057
          Default Java 1.8.0_101
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/13500/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          unit https://builds.apache.org/job/PreCommit-YARN-Build/13500/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/13500/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13500/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/13500/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 14s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 37s trunk passed +1 compile 0m 31s trunk passed +1 checkstyle 0m 20s trunk passed +1 mvnsite 0m 38s trunk passed +1 mvneclipse 0m 16s trunk passed +1 findbugs 0m 55s trunk passed +1 javadoc 0m 21s trunk passed +1 mvninstall 0m 31s the patch passed +1 compile 0m 29s the patch passed +1 javac 0m 29s the patch passed -1 checkstyle 0m 18s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 12 unchanged - 0 fixed = 13 total (was 12) +1 mvnsite 0m 35s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 6s the patch passed +1 javadoc 0m 17s the patch passed -1 unit 35m 50s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 15s The patch does not generate ASF License warnings. 50m 3s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12835106/YARN-4743-v3.patch JIRA Issue YARN-4743 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux be86ad9498f5 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / dbd2057 Default Java 1.8.0_101 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/13500/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/13500/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/13500/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13500/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/13500/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 18s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 6m 45s trunk passed
          +1 compile 0m 32s trunk passed
          +1 checkstyle 0m 20s trunk passed
          +1 mvnsite 0m 37s trunk passed
          +1 mvneclipse 0m 17s trunk passed
          +1 findbugs 0m 58s trunk passed
          +1 javadoc 0m 21s trunk passed
          +1 mvninstall 0m 31s the patch passed
          +1 compile 0m 28s the patch passed
          +1 javac 0m 28s the patch passed
          +1 checkstyle 0m 18s the patch passed
          +1 mvnsite 0m 35s the patch passed
          +1 mvneclipse 0m 13s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 0s the patch passed
          +1 javadoc 0m 18s the patch passed
          +1 unit 39m 30s hadoop-yarn-server-resourcemanager in the patch passed.
          +1 asflicense 0m 15s The patch does not generate ASF License warnings.
          53m 55s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12835157/YARN-4743-v4.patch
          JIRA Issue YARN-4743
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux f2990e7beca8 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / dbd2057
          Default Java 1.8.0_101
          findbugs v3.0.0
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13503/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/13503/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 18s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 45s trunk passed +1 compile 0m 32s trunk passed +1 checkstyle 0m 20s trunk passed +1 mvnsite 0m 37s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 0m 58s trunk passed +1 javadoc 0m 21s trunk passed +1 mvninstall 0m 31s the patch passed +1 compile 0m 28s the patch passed +1 javac 0m 28s the patch passed +1 checkstyle 0m 18s the patch passed +1 mvnsite 0m 35s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 0s the patch passed +1 javadoc 0m 18s the patch passed +1 unit 39m 30s hadoop-yarn-server-resourcemanager in the patch passed. +1 asflicense 0m 15s The patch does not generate ASF License warnings. 53m 55s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12835157/YARN-4743-v4.patch JIRA Issue YARN-4743 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux f2990e7beca8 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / dbd2057 Default Java 1.8.0_101 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13503/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/13503/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          yufeigu Yufei Gu added a comment -

          Zephyr Guo, the v4 looks really good. A minor nit: can you describe how to deal with zero-weight in document of FairShareComparator? Something like "give slots to the non-zero weight one if the other one's weight is zero". Thanks!

          Show
          yufeigu Yufei Gu added a comment - Zephyr Guo , the v4 looks really good. A minor nit: can you describe how to deal with zero-weight in document of FairShareComparator? Something like "give slots to the non-zero weight one if the other one's weight is zero". Thanks!
          Hide
          yufeigu Yufei Gu added a comment - - edited

          Zephyr Guo, since we're doing batch review, I upload patch v5 for you including some changes in doc. So Karthik Kambatla can review and commit.

          Show
          yufeigu Yufei Gu added a comment - - edited Zephyr Guo , since we're doing batch review, I upload patch v5 for you including some changes in doc. So Karthik Kambatla can review and commit.
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 13s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 7m 3s trunk passed
          +1 compile 0m 34s trunk passed
          +1 checkstyle 0m 20s trunk passed
          +1 mvnsite 0m 39s trunk passed
          +1 mvneclipse 0m 17s trunk passed
          +1 findbugs 1m 0s trunk passed
          +1 javadoc 0m 21s trunk passed
          +1 mvninstall 0m 32s the patch passed
          +1 compile 0m 31s the patch passed
          +1 javac 0m 31s the patch passed
          +1 checkstyle 0m 19s the patch passed
          +1 mvnsite 0m 37s the patch passed
          +1 mvneclipse 0m 15s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 5s the patch passed
          +1 javadoc 0m 18s the patch passed
          +1 unit 35m 54s hadoop-yarn-server-resourcemanager in the patch passed.
          +1 asflicense 0m 18s The patch does not generate ASF License warnings.
          51m 35s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Issue YARN-4743
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12835683/YARN-4743-v5.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 21ab7a0a2f6c 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 022bf78
          Default Java 1.8.0_101
          findbugs v3.0.0
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13598/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/13598/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 13s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 7m 3s trunk passed +1 compile 0m 34s trunk passed +1 checkstyle 0m 20s trunk passed +1 mvnsite 0m 39s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 1m 0s trunk passed +1 javadoc 0m 21s trunk passed +1 mvninstall 0m 32s the patch passed +1 compile 0m 31s the patch passed +1 javac 0m 31s the patch passed +1 checkstyle 0m 19s the patch passed +1 mvnsite 0m 37s the patch passed +1 mvneclipse 0m 15s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 5s the patch passed +1 javadoc 0m 18s the patch passed +1 unit 35m 54s hadoop-yarn-server-resourcemanager in the patch passed. +1 asflicense 0m 18s The patch does not generate ASF License warnings. 51m 35s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Issue YARN-4743 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12835683/YARN-4743-v5.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 21ab7a0a2f6c 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 022bf78 Default Java 1.8.0_101 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13598/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/13598/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          kasha Karthik Kambatla added a comment -

          +1

          Checking this in..

          Show
          kasha Karthik Kambatla added a comment - +1 Checking this in..
          Hide
          kasha Karthik Kambatla added a comment -

          Thanks Zephyr Guo for the contribution, Yufei Gu for the reviews and updating the patch.

          Just committed this to trunk and branch-2.

          Show
          kasha Karthik Kambatla added a comment - Thanks Zephyr Guo for the contribution, Yufei Gu for the reviews and updating the patch. Just committed this to trunk and branch-2.
          Hide
          gzh1992n Zephyr Guo added a comment -

          Thanks Yufei Gu, Karthik Kambatla for reviewing!

          Show
          gzh1992n Zephyr Guo added a comment - Thanks Yufei Gu , Karthik Kambatla for reviewing!
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10715 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10715/)
          YARN-4743. FairSharePolicy breaks TimSort assumption. (Zephyr Guo and (kasha: rev 4df8ed63ed93f2542e4b48f521b0cc6624ab59c1)

          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FairSharePolicy.java
          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestSchedulingPolicy.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10715 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10715/ ) YARN-4743 . FairSharePolicy breaks TimSort assumption. (Zephyr Guo and (kasha: rev 4df8ed63ed93f2542e4b48f521b0cc6624ab59c1) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FairSharePolicy.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestSchedulingPolicy.java

            People

            • Assignee:
              gzh1992n Zephyr Guo
              Reporter:
              gzh1992n Zephyr Guo
            • Votes:
              1 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development