Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5333

Some recovered apps are put into default queue when RM HA

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.0, 2.7.4, 3.0.0-alpha1, 2.8.2
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Enable RM HA and use FairScheduler, yarn.scheduler.fair.allow-undeclared-pools is set to false, yarn.scheduler.fair.user-as-default-queue is set to false.

      Reproduce steps:
      1. Start two RMs.
      2. After RMs are running, change both RM's file etc/hadoop/fair-scheduler.xml, then add some queues.
      3. Submit some apps to the new added queues.
      4. Stop the active RM, then the standby RM will transit to active and recover apps.
      However the new active RM will put recovered apps into default queue because it might have not loaded the new fair-scheduler.xml. We need call initScheduler before start active services or bring refreshAll() in front of rm.transitionToActive(). It seems it is also important for other scheduler.

      1. YARN-5333.01.patch
        1 kB
        Jun Gong
      2. YARN-5333.02.patch
        7 kB
        Jun Gong
      3. YARN-5333.03.patch
        9 kB
        Jun Gong
      4. YARN-5333.04.patch
        11 kB
        Jun Gong
      5. YARN-5333.05.patch
        11 kB
        Jun Gong
      6. YARN-5333.06.patch
        17 kB
        Jun Gong
      7. YARN-5333.07.patch
        18 kB
        Jun Gong
      8. YARN-5333.08.patch
        18 kB
        Jun Gong
      9. YARN-5333.09.patch
        16 kB
        Jun Gong
      10. YARN-5333.10.patch
        15 kB
        Jun Gong

        Issue Links

          Activity

          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          However the new active RM will reject recovered apps because it might have not loaded the new fair-scheduler.xml. We need call initScheduler before start active services or bring refreshAll() in front of rm.transitionToActive(). It seems it is aslo important for other scheduler.

          Unfortunately, today, unless you are using the not-commonly-used FileSystemBasedConfigurationProvider stuff, it is expected that the admin update the xml files on both the RMs and issue a refreshQueues on both the RMs. I understand that in the fair-scheduler's case, there is an auto-refresh - this means FairScheduler needs special handling to auto-refresh on standby RM becoming active.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - However the new active RM will reject recovered apps because it might have not loaded the new fair-scheduler.xml. We need call initScheduler before start active services or bring refreshAll() in front of rm.transitionToActive(). It seems it is aslo important for other scheduler. Unfortunately, today, unless you are using the not-commonly-used FileSystemBasedConfigurationProvider stuff, it is expected that the admin update the xml files on both the RMs and issue a refreshQueues on both the RMs. I understand that in the fair-scheduler's case, there is an auto-refresh - this means FairScheduler needs special handling to auto-refresh on standby RM becoming active.
          Hide
          hex108 Jun Gong added a comment -

          Thanks Vinod Kumar Vavilapalli for the comments.

          I checked the code, IIUC fair-scheduler.xml could be only loaded as a local file. And it will cause a StandbyException to issue a refreshQueues on a standby RM, so the problem still exists. I think we need call initScheduler(to refresh queues) when standby RM becomes active or remove the check for ACTIVE state for refreshQueues.

          Show
          hex108 Jun Gong added a comment - Thanks Vinod Kumar Vavilapalli for the comments. I checked the code, IIUC fair-scheduler.xml could be only loaded as a local file. And it will cause a StandbyException to issue a refreshQueues on a standby RM, so the problem still exists. I think we need call initScheduler (to refresh queues) when standby RM becomes active or remove the check for ACTIVE state for refreshQueues .
          Hide
          hex108 Jun Gong added a comment -

          Sorry for my mistakes:
          1. We changed some code in our code, so that apps will be rejected if the queue does not exist. For the trunk code, recovered apps will be put into 'default' queue if their queue does not exist. I think we still need fix it.
          2. It will only refresh active RM's queues when running rmadmin -refreshQueues on both active and standby RM.

          Attach a patch to fix the problem that mentioned above. I verify it on my local cluster and it works.

          Show
          hex108 Jun Gong added a comment - Sorry for my mistakes: 1. We changed some code in our code, so that apps will be rejected if the queue does not exist. For the trunk code, recovered apps will be put into 'default' queue if their queue does not exist. I think we still need fix it. 2. It will only refresh active RM's queues when running rmadmin -refreshQueues on both active and standby RM. Attach a patch to fix the problem that mentioned above. I verify it on my local cluster and it works.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 35s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 mvninstall 7m 19s trunk passed
          +1 compile 0m 33s trunk passed
          +1 checkstyle 0m 23s trunk passed
          +1 mvnsite 0m 38s trunk passed
          +1 mvneclipse 0m 19s trunk passed
          +1 findbugs 1m 1s trunk passed
          +1 javadoc 0m 21s trunk passed
          +1 mvninstall 0m 34s the patch passed
          +1 compile 0m 31s the patch passed
          +1 javac 0m 31s the patch passed
          +1 checkstyle 0m 19s the patch passed
          +1 mvnsite 0m 38s the patch passed
          +1 mvneclipse 0m 15s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 6s the patch passed
          +1 javadoc 0m 20s the patch passed
          -1 unit 37m 37s hadoop-yarn-server-resourcemanager in the patch failed.
          +1 asflicense 0m 16s The patch does not generate ASF License warnings.
          53m 22s



          Reason Tests
          Failed junit tests hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12817453/YARN-5333.01.patch
          JIRA Issue YARN-5333
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 8fcc99471cec 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 7705812
          Default Java 1.8.0_91
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-YARN-Build/12287/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12287/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12287/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/12287/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 35s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 7m 19s trunk passed +1 compile 0m 33s trunk passed +1 checkstyle 0m 23s trunk passed +1 mvnsite 0m 38s trunk passed +1 mvneclipse 0m 19s trunk passed +1 findbugs 1m 1s trunk passed +1 javadoc 0m 21s trunk passed +1 mvninstall 0m 34s the patch passed +1 compile 0m 31s the patch passed +1 javac 0m 31s the patch passed +1 checkstyle 0m 19s the patch passed +1 mvnsite 0m 38s the patch passed +1 mvneclipse 0m 15s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 6s the patch passed +1 javadoc 0m 20s the patch passed -1 unit 37m 37s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 16s The patch does not generate ASF License warnings. 53m 22s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12817453/YARN-5333.01.patch JIRA Issue YARN-5333 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 8fcc99471cec 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 7705812 Default Java 1.8.0_91 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/12287/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12287/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12287/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/12287/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          hex108 Jun Gong added a comment -

          Add a test case in the new patch to reproduce the problem.

          Show
          hex108 Jun Gong added a comment - Add a test case in the new patch to reproduce the problem.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 40s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 7m 17s trunk passed
          +1 compile 0m 36s trunk passed
          +1 checkstyle 0m 23s trunk passed
          +1 mvnsite 0m 42s trunk passed
          +1 mvneclipse 0m 20s trunk passed
          +1 findbugs 1m 11s trunk passed
          +1 javadoc 0m 25s trunk passed
          +1 mvninstall 0m 38s the patch passed
          +1 compile 0m 39s the patch passed
          +1 javac 0m 39s the patch passed
          +1 checkstyle 0m 25s the patch passed
          +1 mvnsite 0m 45s the patch passed
          +1 mvneclipse 0m 19s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 16s the patch passed
          +1 javadoc 0m 23s the patch passed
          -1 unit 42m 8s hadoop-yarn-server-resourcemanager in the patch failed.
          +1 asflicense 0m 20s The patch does not generate ASF License warnings.
          59m 11s



          Reason Tests
          Failed junit tests hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart
            hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12817678/YARN-5333.02.patch
          JIRA Issue YARN-5333
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux fc5c75cacfbd 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / d6d41e8
          Default Java 1.8.0_91
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-YARN-Build/12310/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12310/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12310/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/12310/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 40s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 7m 17s trunk passed +1 compile 0m 36s trunk passed +1 checkstyle 0m 23s trunk passed +1 mvnsite 0m 42s trunk passed +1 mvneclipse 0m 20s trunk passed +1 findbugs 1m 11s trunk passed +1 javadoc 0m 25s trunk passed +1 mvninstall 0m 38s the patch passed +1 compile 0m 39s the patch passed +1 javac 0m 39s the patch passed +1 checkstyle 0m 25s the patch passed +1 mvnsite 0m 45s the patch passed +1 mvneclipse 0m 19s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 16s the patch passed +1 javadoc 0m 23s the patch passed -1 unit 42m 8s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 20s The patch does not generate ASF License warnings. 59m 11s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart   hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12817678/YARN-5333.02.patch JIRA Issue YARN-5333 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux fc5c75cacfbd 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / d6d41e8 Default Java 1.8.0_91 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/12310/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12310/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12310/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/12310/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          hex108 Jun Gong added a comment -

          The reason for test case errors in TestRMWebServicesAppsModification(e.g. testAppMove) is that they reinitialize CapacityScheduler with a new CapacitySchedulerConfiguration before rm.start() and it will cause problems to reinitialize it two times. However from another point of view, I think CapacityScheduler also needs this patch. Vinod Kumar Vavilapalli, Varun Vasudev could you please help confirm it? Thanks!

          Show
          hex108 Jun Gong added a comment - The reason for test case errors in TestRMWebServicesAppsModification(e.g. testAppMove) is that they reinitialize CapacityScheduler with a new CapacitySchedulerConfiguration before rm.start() and it will cause problems to reinitialize it two times. However from another point of view, I think CapacityScheduler also needs this patch. Vinod Kumar Vavilapalli , Varun Vasudev could you please help confirm it? Thanks!
          Hide
          hex108 Jun Gong added a comment -

          I verified it for CapacityScheduler:
          1. Without the patch, apps that submitted to new added queues will be killed, the diagnostics message is "Application killed on recovery as it was submitted to queue c which no longer exists after restart.".
          2. With the patch, apps will be recovered normally.

          Show
          hex108 Jun Gong added a comment - I verified it for CapacityScheduler: 1. Without the patch, apps that submitted to new added queues will be killed, the diagnostics message is "Application killed on recovery as it was submitted to queue c which no longer exists after restart.". 2. With the patch, apps will be recovered normally.
          Hide
          hex108 Jun Gong added a comment -

          Attach a new patch 03.patch to fix the test case error.

          Could someone please help review it? Thanks!

          Show
          hex108 Jun Gong added a comment - Attach a new patch 03.patch to fix the test case error. Could someone please help review it? Thanks!
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 17s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
          +1 mvninstall 8m 20s trunk passed
          +1 compile 0m 37s trunk passed
          +1 checkstyle 0m 26s trunk passed
          +1 mvnsite 0m 44s trunk passed
          +1 mvneclipse 0m 19s trunk passed
          +1 findbugs 1m 7s trunk passed
          +1 javadoc 0m 23s trunk passed
          +1 mvninstall 0m 37s the patch passed
          +1 compile 0m 35s the patch passed
          +1 javac 0m 35s the patch passed
          +1 checkstyle 0m 23s the patch passed
          +1 mvnsite 0m 41s the patch passed
          +1 mvneclipse 0m 15s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 13s the patch passed
          +1 javadoc 0m 21s the patch passed
          +1 unit 34m 46s hadoop-yarn-server-resourcemanager in the patch passed.
          +1 asflicense 0m 17s The patch does not generate ASF License warnings.
          52m 3s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12819114/YARN-5333.03.patch
          JIRA Issue YARN-5333
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux c8f8c2718e05 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / e340064
          Default Java 1.8.0_91
          findbugs v3.0.0
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12428/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/12428/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 8m 20s trunk passed +1 compile 0m 37s trunk passed +1 checkstyle 0m 26s trunk passed +1 mvnsite 0m 44s trunk passed +1 mvneclipse 0m 19s trunk passed +1 findbugs 1m 7s trunk passed +1 javadoc 0m 23s trunk passed +1 mvninstall 0m 37s the patch passed +1 compile 0m 35s the patch passed +1 javac 0m 35s the patch passed +1 checkstyle 0m 23s the patch passed +1 mvnsite 0m 41s the patch passed +1 mvneclipse 0m 15s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 13s the patch passed +1 javadoc 0m 21s the patch passed +1 unit 34m 46s hadoop-yarn-server-resourcemanager in the patch passed. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 52m 3s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12819114/YARN-5333.03.patch JIRA Issue YARN-5333 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux c8f8c2718e05 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / e340064 Default Java 1.8.0_91 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12428/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/12428/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 27s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
          +1 mvninstall 7m 7s trunk passed
          +1 compile 0m 34s trunk passed
          +1 checkstyle 0m 26s trunk passed
          +1 mvnsite 0m 46s trunk passed
          +1 mvneclipse 0m 19s trunk passed
          +1 findbugs 1m 7s trunk passed
          +1 javadoc 0m 25s trunk passed
          +1 mvninstall 0m 37s the patch passed
          +1 compile 0m 34s the patch passed
          +1 javac 0m 34s the patch passed
          +1 checkstyle 0m 23s the patch passed
          +1 mvnsite 0m 39s the patch passed
          +1 mvneclipse 0m 15s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 15s the patch passed
          +1 javadoc 0m 21s the patch passed
          +1 unit 33m 45s hadoop-yarn-server-resourcemanager in the patch passed.
          +1 asflicense 0m 15s The patch does not generate ASF License warnings.
          49m 57s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12819114/YARN-5333.03.patch
          JIRA Issue YARN-5333
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux b2ace206f17c 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 521f343
          Default Java 1.8.0_91
          findbugs v3.0.0
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12432/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/12432/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 27s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 7m 7s trunk passed +1 compile 0m 34s trunk passed +1 checkstyle 0m 26s trunk passed +1 mvnsite 0m 46s trunk passed +1 mvneclipse 0m 19s trunk passed +1 findbugs 1m 7s trunk passed +1 javadoc 0m 25s trunk passed +1 mvninstall 0m 37s the patch passed +1 compile 0m 34s the patch passed +1 javac 0m 34s the patch passed +1 checkstyle 0m 23s the patch passed +1 mvnsite 0m 39s the patch passed +1 mvneclipse 0m 15s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 15s the patch passed +1 javadoc 0m 21s the patch passed +1 unit 33m 45s hadoop-yarn-server-resourcemanager in the patch passed. +1 asflicense 0m 15s The patch does not generate ASF License warnings. 49m 57s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12819114/YARN-5333.03.patch JIRA Issue YARN-5333 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux b2ace206f17c 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 521f343 Default Java 1.8.0_91 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12432/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/12432/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          sunilg Sunil G added a comment -

          Hi Jun Gong
          Thanks for working on this patch. I have few doubts on the test setup of your while testing with CS.

          Without the patch, apps that submitted to new added queues will be killed, the diagnostics message is "Application killed on recovery as it was submitted to queue c which no longer exists after restart.".

          While you added a new queue, have you performed "yarn rmadmin refreshQueues" command. This is to ensure the changed queue topology is refreshed. Note: CS doesnt have an auto refresh like Fair. Also if you were not using something like FileSystemBasedConfigurationProvider, i think you have update the same change configuration for queue change in both nodes. At this point of time, if you do any HA, you wont be getting this issue.
          Could you please help to confirm once. and pls correct me if I missed some steps which you may have done.

          Show
          sunilg Sunil G added a comment - Hi Jun Gong Thanks for working on this patch. I have few doubts on the test setup of your while testing with CS. Without the patch, apps that submitted to new added queues will be killed, the diagnostics message is "Application killed on recovery as it was submitted to queue c which no longer exists after restart.". While you added a new queue, have you performed "yarn rmadmin refreshQueues" command. This is to ensure the changed queue topology is refreshed. Note : CS doesnt have an auto refresh like Fair. Also if you were not using something like FileSystemBasedConfigurationProvider , i think you have update the same change configuration for queue change in both nodes. At this point of time, if you do any HA, you wont be getting this issue. Could you please help to confirm once. and pls correct me if I missed some steps which you may have done.
          Hide
          hex108 Jun Gong added a comment -

          Thanks Sunil G for review and comments.

          I tested with normal config file etc/hadoop/capacity-scheduler.xml, update it on the two RMs and run yarn rmadmin refreshQueues on both RMs. IIUC the command yarn rmadmin refreshQueues only takes effect on active RM, checkRMStatus in the following code will check whether it is a active RM.

          @Override
            public RefreshQueuesResponse refreshQueues(RefreshQueuesRequest request)
                throws YarnException, StandbyException {
              String argName = "refreshQueues";
              final String msg = "refresh queues.";
              UserGroupInformation user = checkAcls(argName);
          
              checkRMStatus(user.getShortUserName(), argName, msg);
              
              ...
          }
          
          Show
          hex108 Jun Gong added a comment - Thanks Sunil G for review and comments. I tested with normal config file etc/hadoop/capacity-scheduler.xml , update it on the two RMs and run yarn rmadmin refreshQueues on both RMs. IIUC the command yarn rmadmin refreshQueues only takes effect on active RM, checkRMStatus in the following code will check whether it is a active RM. @Override public RefreshQueuesResponse refreshQueues(RefreshQueuesRequest request) throws YarnException, StandbyException { String argName = "refreshQueues" ; final String msg = "refresh queues." ; UserGroupInformation user = checkAcls(argName); checkRMStatus(user.getShortUserName(), argName, msg); ... }
          Hide
          sunilg Sunil G added a comment - - edited

          HI Jun Gong,
          Thanks for pointing out regarding refreshQueues failure on Standby node.
          Could you also please confirm that whether you have added new queue manually in capacity-scheduler.xml of Standby node, and test the same scenario.

          Because the current approach in your patch will induce a new problem. Suppose if capacity-scheduler.xml is corrupted, then we will say a case where bth RMs will toggle to become active. We had discussed this solutions in another HA ticket and has thought about not trying to do any refresh until active services are started.

          Show
          sunilg Sunil G added a comment - - edited HI Jun Gong , Thanks for pointing out regarding refreshQueues failure on Standby node. Could you also please confirm that whether you have added new queue manually in capacity-scheduler.xml of Standby node, and test the same scenario. Because the current approach in your patch will induce a new problem. Suppose if capacity-scheduler.xml is corrupted, then we will say a case where bth RMs will toggle to become active. We had discussed this solutions in another HA ticket and has thought about not trying to do any refresh until active services are started.
          Hide
          hex108 Jun Gong added a comment -

          Could you also please confirm that whether you have added new queue manually in capacity-scheduler.xml of Standby node, and test the same scenario.

          I copy the capacity-scheduler.xml from active RM to standby RM, then they are same on both RMs. Yes, I tested the same scenario.

          Because the current approach in your patch will induce a new problem. Suppose if capacity-scheduler.xml is corrupted, then we will say a case where bth RMs will toggle to become active. We had discussed this solutions in another HA ticket and has thought about not trying to do any refresh until active services are started.

          If if capacity-scheduler.xml was corrupted, I saw RM crashed when RM HA because it failed to validateConf(CapacityScheduler.validateConf)(Note: when capacity-scheduler.xml is corrupted, running {{refreshQueues }} will just fail and not cause RM to crash). Is there something I missed?

          Show
          hex108 Jun Gong added a comment - Could you also please confirm that whether you have added new queue manually in capacity-scheduler.xml of Standby node, and test the same scenario. I copy the capacity-scheduler.xml from active RM to standby RM, then they are same on both RMs. Yes, I tested the same scenario. Because the current approach in your patch will induce a new problem. Suppose if capacity-scheduler.xml is corrupted, then we will say a case where bth RMs will toggle to become active. We had discussed this solutions in another HA ticket and has thought about not trying to do any refresh until active services are started. If if capacity-scheduler.xml was corrupted, I saw RM crashed when RM HA because it failed to validateConf( CapacityScheduler.validateConf )(Note: when capacity-scheduler.xml is corrupted, running {{refreshQueues }} will just fail and not cause RM to crash). Is there something I missed?
          Hide
          sunilg Sunil G added a comment - - edited

          Thanks Jun Gong
          Yes, we are recovering apps (by calling startActiveServices) first and then only trying to do refreshQueues from AdminService#transitionToActive. So apps on newly added queue will fail during recovery.

          when capacity-scheduler.xml is corrupted, running {{refreshQueues }} will just fail

          As per your patch if refreshQueues raise an exception may be due to a corrupted conf file, then we can see RMs will toggle. In YARN-3893, they placed all refresh calls after transitionToActive call to handle this case. During that time, I made the similar suggestion (I suggested refreshAll) as given in this patch now. Pls refer my comment. Rohith Sharma K S helped to point out a possible problem with this approach.

          I agree that its a pblm in CS given we are using normal conf file. So If we could handle the exception from refreshQeueues which can be called prior to rm.transitionToActive() and do fail fast directly, then we can somehow manage both issues. Rohith Sharma K S, Jian He Thoughts?

          Show
          sunilg Sunil G added a comment - - edited Thanks Jun Gong Yes, we are recovering apps (by calling startActiveServices) first and then only trying to do refreshQueues from AdminService#transitionToActive . So apps on newly added queue will fail during recovery. when capacity-scheduler.xml is corrupted, running {{refreshQueues }} will just fail As per your patch if refreshQueues raise an exception may be due to a corrupted conf file, then we can see RMs will toggle. In YARN-3893 , they placed all refresh calls after transitionToActive call to handle this case. During that time, I made the similar suggestion (I suggested refreshAll) as given in this patch now. Pls refer my comment . Rohith Sharma K S helped to point out a possible problem with this approach. I agree that its a pblm in CS given we are using normal conf file. So If we could handle the exception from refreshQeueues which can be called prior to rm.transitionToActive() and do fail fast directly , then we can somehow manage both issues. Rohith Sharma K S , Jian He Thoughts?
          Hide
          hex108 Jun Gong added a comment -

          refreshQueues will cause StandbyException, however rmContext.getScheduler().reinitialize() in the patch will not cause the StandbyException. So I think the patch will not introduce the problem mentioned in YARN-3893. If there is an exception caused by rmContext.getScheduler().reinitialize(), RM will transit to standby state.

          Show
          hex108 Jun Gong added a comment - refreshQueues will cause StandbyException, however rmContext.getScheduler().reinitialize() in the patch will not cause the StandbyException. So I think the patch will not introduce the problem mentioned in YARN-3893 . If there is an exception caused by rmContext.getScheduler().reinitialize() , RM will transit to standby state.
          Hide
          sunilg Sunil G added a comment -

          Jun Gong, thanks for the clarification. With YARN-3893, we were trying to fail-fast RM if wrong capacity-scheduler is present. With the current patch,

                   try {
          +          reinitializeActiveServices();
                     startActiveServices();
                     return null;
                   } catch (Exception e) {
          

          any exception during queue reinitialize will not make RM fail-fast. So I think you can have reinitializeActiveServices in another try block and invoke RM fail-fast with its exception handling block.
          However one more thing worries me. with this patch, reinitialize queue is done before starting the active services. So many service like nodelabel manager etc are not started (or dispatcher threads are not started). So if reinitialize has some event call flow, then such case may be a pblm. But as far as I checked, no such event handling is present in reinitialize call flow. Still I suggest to confirm once, I will also verify and will update if I find some leads.

          Show
          sunilg Sunil G added a comment - Jun Gong , thanks for the clarification. With YARN-3893 , we were trying to fail-fast RM if wrong capacity-scheduler is present. With the current patch, try { + reinitializeActiveServices(); startActiveServices(); return null ; } catch (Exception e) { any exception during queue reinitialize will not make RM fail-fast. So I think you can have reinitializeActiveServices in another try block and invoke RM fail-fast with its exception handling block. However one more thing worries me. with this patch, reinitialize queue is done before starting the active services. So many service like nodelabel manager etc are not started (or dispatcher threads are not started). So if reinitialize has some event call flow, then such case may be a pblm. But as far as I checked, no such event handling is present in reinitialize call flow. Still I suggest to confirm once, I will also verify and will update if I find some leads.
          Hide
          hex108 Jun Gong added a comment -

          Thanks Sunil G.

          Yes, fail-fast seems better.

          However one more thing worries me. with this patch, reinitialize queue is done before starting the active services. Still I suggest to confirm once, I will also verify and will update if I find some leads.

          Thanks for it! I will check it too. If it is OK and without more comments, I will update the patch to address the 'fail-fast' problem.

          Show
          hex108 Jun Gong added a comment - Thanks Sunil G . Yes, fail-fast seems better. However one more thing worries me. with this patch, reinitialize queue is done before starting the active services. Still I suggest to confirm once, I will also verify and will update if I find some leads. Thanks for it! I will check it too. If it is OK and without more comments, I will update the patch to address the 'fail-fast' problem.
          Hide
          jianhe Jian He added a comment -

          If the initialization of active services is done before starting services, (currently done when transitioning to standby), such problem will not occur. Rohith Sharma K S/ Sunil G/Jun Gong, are these problems solvable ? I think switch time is fine. didn't quite get this:

          And RMWebApp has dependency on clienRMService for starting webapps. Without clientRMService initialization, RMWebapp can not be started.

          Show
          jianhe Jian He added a comment - If the initialization of active services is done before starting services, (currently done when transitioning to standby), such problem will not occur. Rohith Sharma K S / Sunil G / Jun Gong , are these problems solvable ? I think switch time is fine. didn't quite get this: And RMWebApp has dependency on clienRMService for starting webapps. Without clientRMService initialization, RMWebapp can not be started.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          Apologies for coming late!

          didn't quite get this:

          IIRC, starting RMWebApp, clientRMService instance being injected. If we do not initialize activeServices in standby then RMWebApp start up fails. This need dig more.

          Show
          rohithsharma Rohith Sharma K S added a comment - Apologies for coming late! didn't quite get this: IIRC, starting RMWebApp, clientRMService instance being injected. If we do not initialize activeServices in standby then RMWebApp start up fails. This need dig more.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          I think for fixing this issue, for any transistionToActive from admin service, refreshAll call first. Other things to consider is, skip RM state check only for transistionToActive nevertheless who calls either admin or elector service. It guaranty that transitioned RM is always started with new configurations. Thoughts?

          Show
          rohithsharma Rohith Sharma K S added a comment - I think for fixing this issue, for any transistionToActive from admin service, refreshAll call first. Other things to consider is, skip RM state check only for transistionToActive nevertheless who calls either admin or elector service. It guaranty that transitioned RM is always started with new configurations. Thoughts?
          Hide
          sunilg Sunil G added a comment -

          Rohith Sharma K S Jian He Jun Gong
          I think its better to refresh config OR init services before starting it. but we might have already inited it before pushing to standby earlier Hence we can only do a refresh alone. If refreshAll can do clean refresh without any state check, I think its good for now. This definitely comes with the cost of initing queue and will impact HA switch time a little. In my opinion, we can try improve/refactor refresh for various cases by considering whether to do HA state check or not. Will this be fine?

          Show
          sunilg Sunil G added a comment - Rohith Sharma K S Jian He Jun Gong I think its better to refresh config OR init services before starting it. but we might have already inited it before pushing to standby earlier Hence we can only do a refresh alone. If refreshAll can do clean refresh without any state check, I think its good for now. This definitely comes with the cost of initing queue and will impact HA switch time a little. In my opinion, we can try improve/refactor refresh for various cases by considering whether to do HA state check or not. Will this be fine?
          Hide
          jianhe Jian He added a comment -

          I prefer doing initialization of services before starting it. Then we don't need to init the services when transitioning to standby, also no need to call refreshAll. But because of the issue mentioned by Rohith, not sure that's doable. If that solution won't work out. I think doing what you suggested makes sense.

          Show
          jianhe Jian He added a comment - I prefer doing initialization of services before starting it. Then we don't need to init the services when transitioning to standby, also no need to call refreshAll. But because of the issue mentioned by Rohith, not sure that's doable. If that solution won't work out. I think doing what you suggested makes sense.
          Hide
          hex108 Jun Gong added a comment -

          Sorry for late reply. Thanks Rohith Sharma K S, Sunil G and Jian He's suggestion.

          IIRC, starting RMWebApp, clientRMService instance being injected. If we do not initialize activeServices in standby then RMWebApp start up fails. This need dig more.

          Do you mean ResourceManager#startWepApp will fail to start? It seems it starts at the beginning of RM start. How could I verify it? Thanks!

          I think it works to skip RM state check for the transitionToActive case. I could not figure out that why we need check RM status, in my opinion RM will not execute these refresh functions if it is in active state. Could you please explain it more?

          I prefer doing initialization of services before starting it. Then we don't need to init the services when transitioning to standby, also no need to call refreshAll.

          Just another thought: If the time spent by reinitialize does not matter a lot, how about adding initialization of services in the two places(at the beginning of transitionToActive and at the end of transtionToStandby)?

          Show
          hex108 Jun Gong added a comment - Sorry for late reply. Thanks Rohith Sharma K S , Sunil G and Jian He 's suggestion. IIRC, starting RMWebApp, clientRMService instance being injected. If we do not initialize activeServices in standby then RMWebApp start up fails. This need dig more. Do you mean ResourceManager#startWepApp will fail to start? It seems it starts at the beginning of RM start. How could I verify it? Thanks! I think it works to skip RM state check for the transitionToActive case. I could not figure out that why we need check RM status, in my opinion RM will not execute these refresh functions if it is in active state. Could you please explain it more? I prefer doing initialization of services before starting it. Then we don't need to init the services when transitioning to standby, also no need to call refreshAll. Just another thought: If the time spent by reinitialize does not matter a lot, how about adding initialization of services in the two places(at the beginning of transitionToActive and at the end of transtionToStandby)?
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          Do you mean ResourceManager#startWepApp will fail to start?

          IIRC, Yes, it used to fail because of the clientRMService injection into RMWebApp long time ago. Even I did not go deeper into investigate that time. I suggest that you can try as Jian suggested i.e createAndInitActiveServices during RM#transitionToActive.

          Show
          rohithsharma Rohith Sharma K S added a comment - Do you mean ResourceManager#startWepApp will fail to start? IIRC, Yes, it used to fail because of the clientRMService injection into RMWebApp long time ago. Even I did not go deeper into investigate that time. I suggest that you can try as Jian suggested i.e createAndInitActiveServices during RM#transitionToActive.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          I just tried modifying the code, the below error I was talking that RMWebApp start fails.

          com.google.inject.CreationException: Unable to create injector, see the following errors:
          
          1) Binding to null instances is not allowed. Use toProvider(Providers.of(null)) if this is your intended behaviour.
            at org.apache.hadoop.yarn.webapp.WebApps$Builder$2.configure(WebApps.java:335)
          
          1 error
          	at com.google.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:466)
          	at com.google.inject.internal.InternalInjectorCreator.initializeStatically(InternalInjectorCreator.java:155)
          	at com.google.inject.internal.InternalInjectorCreator.build(InternalInjectorCreator.java:107)
          	at com.google.inject.Guice.createInjector(Guice.java:96)
          	at com.google.inject.Guice.createInjector(Guice.java:73)
          	at com.google.inject.Guice.createInjector(Guice.java:62)
          	at org.apache.hadoop.yarn.webapp.WebApps$Builder.build(WebApps.java:331)
          	at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:372)
          	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startWepApp(ResourceManager.java:1025)
          	at org.apache.hadoop.yarn.server.resourcemanager.MockRM.startWepApp(MockRM.java:909)
          	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1127)
          	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
          	at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.testRMDispatcherForHA(TestRMHA.java:333)
          

          Apart from above, other point is RMWebService is started in StandBy RM where in REST calls can be made. Since if we do not initialize active services, then we could expect NPE from RMWebService. There are many more things to take care if we go for initializing active services during transitionToActive.

          Show
          rohithsharma Rohith Sharma K S added a comment - I just tried modifying the code, the below error I was talking that RMWebApp start fails. com.google.inject.CreationException: Unable to create injector, see the following errors: 1) Binding to null instances is not allowed. Use toProvider(Providers.of(null)) if this is your intended behaviour. at org.apache.hadoop.yarn.webapp.WebApps$Builder$2.configure(WebApps.java:335) 1 error at com.google.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:466) at com.google.inject.internal.InternalInjectorCreator.initializeStatically(InternalInjectorCreator.java:155) at com.google.inject.internal.InternalInjectorCreator.build(InternalInjectorCreator.java:107) at com.google.inject.Guice.createInjector(Guice.java:96) at com.google.inject.Guice.createInjector(Guice.java:73) at com.google.inject.Guice.createInjector(Guice.java:62) at org.apache.hadoop.yarn.webapp.WebApps$Builder.build(WebApps.java:331) at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:372) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startWepApp(ResourceManager.java:1025) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.startWepApp(MockRM.java:909) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1127) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.testRMDispatcherForHA(TestRMHA.java:333) Apart from above, other point is RMWebService is started in StandBy RM where in REST calls can be made. Since if we do not initialize active services, then we could expect NPE from RMWebService. There are many more things to take care if we go for initializing active services during transitionToActive.
          Hide
          hex108 Jun Gong added a comment -

          Thanks Rohith Sharma K S for verifying it and suggestion!

          I attached a new patch according to previous suggestions. Calling refreshAll before start active services, And those refresh functions will skip RM state check only for transistionToActive.

          Show
          hex108 Jun Gong added a comment - Thanks Rohith Sharma K S for verifying it and suggestion! I attached a new patch according to previous suggestions. Calling refreshAll before start active services, And those refresh functions will skip RM state check only for transistionToActive.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 22s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 8m 38s trunk passed
          +1 compile 0m 41s trunk passed
          +1 checkstyle 0m 29s trunk passed
          +1 mvnsite 0m 53s trunk passed
          +1 mvneclipse 0m 21s trunk passed
          +1 findbugs 1m 16s trunk passed
          +1 javadoc 0m 28s trunk passed
          +1 mvninstall 0m 41s the patch passed
          +1 compile 0m 35s the patch passed
          +1 javac 0m 35s the patch passed
          -1 checkstyle 0m 24s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 238 unchanged - 0 fixed = 239 total (was 238)
          +1 mvnsite 0m 41s the patch passed
          +1 mvneclipse 0m 17s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 14s the patch passed
          +1 javadoc 0m 25s the patch passed
          -1 unit 39m 13s hadoop-yarn-server-resourcemanager in the patch failed.
          +1 asflicense 0m 15s The patch does not generate ASF License warnings.
          57m 44s



          Reason Tests
          Failed junit tests hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart
            hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12821328/YARN-5333.04.patch
          JIRA Issue YARN-5333
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 245628a0a868 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 95694b7
          Default Java 1.8.0_101
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/12585/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          unit https://builds.apache.org/job/PreCommit-YARN-Build/12585/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12585/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12585/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/12585/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 22s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 8m 38s trunk passed +1 compile 0m 41s trunk passed +1 checkstyle 0m 29s trunk passed +1 mvnsite 0m 53s trunk passed +1 mvneclipse 0m 21s trunk passed +1 findbugs 1m 16s trunk passed +1 javadoc 0m 28s trunk passed +1 mvninstall 0m 41s the patch passed +1 compile 0m 35s the patch passed +1 javac 0m 35s the patch passed -1 checkstyle 0m 24s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 238 unchanged - 0 fixed = 239 total (was 238) +1 mvnsite 0m 41s the patch passed +1 mvneclipse 0m 17s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 14s the patch passed +1 javadoc 0m 25s the patch passed -1 unit 39m 13s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 15s The patch does not generate ASF License warnings. 57m 44s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart   hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12821328/YARN-5333.04.patch JIRA Issue YARN-5333 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 245628a0a868 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 95694b7 Default Java 1.8.0_101 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/12585/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/12585/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12585/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12585/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/12585/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          hex108 Jun Gong added a comment -

          Attach a new patch to fix checkstyle error. Test cases error are not related, TestAMRestart is tracked in YARN-5043, the other might be solved by YARN-4312.

          Show
          hex108 Jun Gong added a comment - Attach a new patch to fix checkstyle error. Test cases error are not related, TestAMRestart is tracked in YARN-5043 , the other might be solved by YARN-4312 .
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 16s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 6m 49s trunk passed
          +1 compile 0m 32s trunk passed
          +1 checkstyle 0m 23s trunk passed
          +1 mvnsite 0m 38s trunk passed
          +1 mvneclipse 0m 17s trunk passed
          +1 findbugs 0m 57s trunk passed
          +1 javadoc 0m 21s trunk passed
          +1 mvninstall 0m 32s the patch passed
          +1 compile 0m 30s the patch passed
          +1 javac 0m 30s the patch passed
          +1 checkstyle 0m 21s the patch passed
          +1 mvnsite 0m 35s the patch passed
          +1 mvneclipse 0m 13s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 1s the patch passed
          +1 javadoc 0m 18s the patch passed
          +1 unit 33m 12s hadoop-yarn-server-resourcemanager in the patch passed.
          +1 asflicense 0m 15s The patch does not generate ASF License warnings.
          47m 47s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12821482/YARN-5333.05.patch
          JIRA Issue YARN-5333
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux f72316163f8a 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 9f473cf
          Default Java 1.8.0_101
          findbugs v3.0.0
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12603/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/12603/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 16s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 49s trunk passed +1 compile 0m 32s trunk passed +1 checkstyle 0m 23s trunk passed +1 mvnsite 0m 38s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 0m 57s trunk passed +1 javadoc 0m 21s trunk passed +1 mvninstall 0m 32s the patch passed +1 compile 0m 30s the patch passed +1 javac 0m 30s the patch passed +1 checkstyle 0m 21s the patch passed +1 mvnsite 0m 35s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 1s the patch passed +1 javadoc 0m 18s the patch passed +1 unit 33m 12s hadoop-yarn-server-resourcemanager in the patch passed. +1 asflicense 0m 15s The patch does not generate ASF License warnings. 47m 47s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12821482/YARN-5333.05.patch JIRA Issue YARN-5333 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux f72316163f8a 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 9f473cf Default Java 1.8.0_101 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12603/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/12603/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          Thanks for the patch, some comments

          1. Should private boolean isTransitingToActive = false; is volatile?
          2. Since none of the refreshXXX methods are synchronized, patch introduces a concurrency issue. If there is an explicit admin call for refreshing at the time of transitionToActive, then checkRMStatus will be executed for other admin calls. Until RM transition-to-active completely, explicit admin commands should not allowed to refresh. I think, we should incorporate similar to refreshAdminAcl method.
          3. I think flag checkRMHAState can be passed to method checkRMStatus.

          Test:

          1. I think if you can simulate test for generally instead of specific to fair scheduler, this test can be moved to class TestRMHA. There is already test TestRMHA#testTransitionedToActiveRefreshFail, probable the same test can be changed?
          Show
          rohithsharma Rohith Sharma K S added a comment - Thanks for the patch, some comments Should private boolean isTransitingToActive = false; is volatile? Since none of the refreshXXX methods are synchronized, patch introduces a concurrency issue. If there is an explicit admin call for refreshing at the time of transitionToActive, then checkRMStatus will be executed for other admin calls. Until RM transition-to-active completely, explicit admin commands should not allowed to refresh. I think, we should incorporate similar to refreshAdminAcl method. I think flag checkRMHAState can be passed to method checkRMStatus . Test: I think if you can simulate test for generally instead of specific to fair scheduler, this test can be moved to class TestRMHA . There is already test TestRMHA#testTransitionedToActiveRefreshFail , probable the same test can be changed?
          Hide
          jianhe Jian He added a comment -

          Instead of reusing the existing refreshAll method, I checked each refresh method, it should be cleaner to just create a new method which includes all necessary reconfig steps. This also avoids unnecessary audit logs, acl checks. I feel trying to reuse exiting methods is making it worse in this case..

          Show
          jianhe Jian He added a comment - Instead of reusing the existing refreshAll method, I checked each refresh method, it should be cleaner to just create a new method which includes all necessary reconfig steps. This also avoids unnecessary audit logs, acl checks. I feel trying to reuse exiting methods is making it worse in this case..
          Hide
          hex108 Jun Gong added a comment - - edited

          Thanks Rohith Sharma K S, Jian He for the review and comments!

          1. Should private boolean isTransitingToActive = false; is volatile?

          Yes, it needs be volatile. I'll update it.

          2. Since none of the refreshXXX methods are synchronized, patch introduces a concurrency issue. If there is an explicit admin call for refreshing at the time of transitionToActive, then checkRMStatus will be executed for other admin calls. Until RM transition-to-active completely, explicit admin commands should not allowed to refresh. I think, we should incorporate similar to refreshAdminAcl method.

          How about adding synchronized to each refresh functions? It avoids adding more logic. When admin command comes, we could just call corresponding refresh functions. I think it does not matter to call refresh function many times.

          3. I think flag checkRMHAState can be passed to method checkRMStatus.

          I was thinking it. If adding checkRMHAState to checkRMStatus, we need add this parameter(checkRMHAState) to all refresh functions too(which is similar to refreshAdminAcl), there are a lot of places that call refresh functions. It might be better to just add a check before checkRMStatus?

          I think if you can simulate test for generally instead of specific to fair scheduler, this test can be moved to class TestRMHA. There is already test TestRMHA#testTransitionedToActiveRefreshFail, probable the same test can be changed?

          Thanks. I'll update the test case.

          Instead of reusing the existing refreshAll method, I checked each refresh method, it should be cleaner to just create a new method which includes all necessary reconfig steps. This also avoids unnecessary audit logs, acl checks.

          Yes, it will be more clear to add a new method to include all reconfig steps. My doubt is that there will be two places that do similar reconfig things(the one is in refresh functions, the other is in the new added method). Then we need to modify both places if there is some change for one of them. I will try to refactor those refresh functions.

          Show
          hex108 Jun Gong added a comment - - edited Thanks Rohith Sharma K S , Jian He for the review and comments! 1. Should private boolean isTransitingToActive = false; is volatile? Yes, it needs be volatile. I'll update it. 2. Since none of the refreshXXX methods are synchronized, patch introduces a concurrency issue. If there is an explicit admin call for refreshing at the time of transitionToActive, then checkRMStatus will be executed for other admin calls. Until RM transition-to-active completely, explicit admin commands should not allowed to refresh. I think, we should incorporate similar to refreshAdminAcl method. How about adding synchronized to each refresh functions? It avoids adding more logic. When admin command comes, we could just call corresponding refresh functions. I think it does not matter to call refresh function many times. 3. I think flag checkRMHAState can be passed to method checkRMStatus. I was thinking it. If adding checkRMHAState to checkRMStatus, we need add this parameter(checkRMHAState) to all refresh functions too(which is similar to refreshAdminAcl), there are a lot of places that call refresh functions. It might be better to just add a check before checkRMStatus? I think if you can simulate test for generally instead of specific to fair scheduler, this test can be moved to class TestRMHA. There is already test TestRMHA#testTransitionedToActiveRefreshFail, probable the same test can be changed? Thanks. I'll update the test case. Instead of reusing the existing refreshAll method, I checked each refresh method, it should be cleaner to just create a new method which includes all necessary reconfig steps. This also avoids unnecessary audit logs, acl checks. Yes, it will be more clear to add a new method to include all reconfig steps. My doubt is that there will be two places that do similar reconfig things(the one is in refresh functions, the other is in the new added method). Then we need to modify both places if there is some change for one of them. I will try to refactor those refresh functions.
          Hide
          hex108 Jun Gong added a comment -

          Attach a new patch.

          According to the suggestion, I abstracted refreshXXXWithout functions that do refresh without checking RM status.

          About the test case, it needs be bounded to a specific scheduler(either Capacity or FairScheduler) to reproduce the error case, so there is no change for it. Is it OK?

          Show
          hex108 Jun Gong added a comment - Attach a new patch. According to the suggestion, I abstracted refreshXXXWithout functions that do refresh without checking RM status. About the test case, it needs be bounded to a specific scheduler(either Capacity or FairScheduler) to reproduce the error case, so there is no change for it. Is it OK?
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 19s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
          +1 mvninstall 6m 43s trunk passed
          +1 compile 0m 31s trunk passed
          +1 checkstyle 0m 23s trunk passed
          +1 mvnsite 0m 37s trunk passed
          +1 mvneclipse 0m 17s trunk passed
          +1 findbugs 0m 55s trunk passed
          +1 javadoc 0m 21s trunk passed
          +1 mvninstall 0m 31s the patch passed
          +1 compile 0m 29s the patch passed
          +1 javac 0m 29s the patch passed
          -1 checkstyle 0m 21s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 254 unchanged - 0 fixed = 255 total (was 254)
          +1 mvnsite 0m 35s the patch passed
          +1 mvneclipse 0m 14s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 1s the patch passed
          +1 javadoc 0m 18s the patch passed
          +1 unit 37m 22s hadoop-yarn-server-resourcemanager in the patch passed.
          +1 asflicense 0m 15s The patch does not generate ASF License warnings.
          51m 50s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12821629/YARN-5333.06.patch
          JIRA Issue YARN-5333
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 0eb4108d33ba 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 7fc70c6
          Default Java 1.8.0_101
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/12612/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12612/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/12612/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 19s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 6m 43s trunk passed +1 compile 0m 31s trunk passed +1 checkstyle 0m 23s trunk passed +1 mvnsite 0m 37s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 0m 55s trunk passed +1 javadoc 0m 21s trunk passed +1 mvninstall 0m 31s the patch passed +1 compile 0m 29s the patch passed +1 javac 0m 29s the patch passed -1 checkstyle 0m 21s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 254 unchanged - 0 fixed = 255 total (was 254) +1 mvnsite 0m 35s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 1s the patch passed +1 javadoc 0m 18s the patch passed +1 unit 37m 22s hadoop-yarn-server-resourcemanager in the patch passed. +1 asflicense 0m 15s The patch does not generate ASF License warnings. 51m 50s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12821629/YARN-5333.06.patch JIRA Issue YARN-5333 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 0eb4108d33ba 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 7fc70c6 Default Java 1.8.0_101 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/12612/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12612/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/12612/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          jianhe Jian He added a comment -

          Not related to this patch, I noticed that, if anything fails in the refresh call, it sends the fatal event. I feel this a bit harsh. how about removing that ?

              } catch (Exception e) {
                LOG.error("RefreshAll failed so firing fatal event", e);
                rmContext
                    .getDispatcher()
                    .getEventHandler()
                    .handle(
                        new RMFatalEvent(RMFatalEventType.TRANSITION_TO_ACTIVE_FAILED,
                            e));
                throw new ServiceFailedException(
                    "Error on refreshAll during transition to Active", e);
              }
          
          Show
          jianhe Jian He added a comment - Not related to this patch, I noticed that, if anything fails in the refresh call, it sends the fatal event. I feel this a bit harsh. how about removing that ? } catch (Exception e) { LOG.error( "RefreshAll failed so firing fatal event" , e); rmContext .getDispatcher() .getEventHandler() .handle( new RMFatalEvent(RMFatalEventType.TRANSITION_TO_ACTIVE_FAILED, e)); throw new ServiceFailedException( "Error on refreshAll during transition to Active" , e); }
          Hide
          hex108 Jun Gong added a comment -

          Hi Jian He, I think the comment in YARN-3893 makes sense. How about it?

          If refreshAll() fails, BOTH RM will be in ACTIVE state as per this defect. Continuing RM services with invalid configuration does not good idea. Moreover invalid configurations should be notified to user immediately. So it would be better to make use of fail-fast configuration to exit the RM JVM. If this configuration is set to false , then call rm.handleTransitionToStandBy.

          Show
          hex108 Jun Gong added a comment - Hi Jian He , I think the comment in YARN-3893 makes sense. How about it? If refreshAll() fails, BOTH RM will be in ACTIVE state as per this defect. Continuing RM services with invalid configuration does not good idea. Moreover invalid configurations should be notified to user immediately. So it would be better to make use of fail-fast configuration to exit the RM JVM. If this configuration is set to false , then call rm.handleTransitionToStandBy.
          Hide
          jianhe Jian He added a comment -

          then call rm.handleTransitionToStandBy.

          I think the ActiveStandbyElector will handle reJoin automatically if exception thrown from the transtionToActive method.
          We can use the fail fast config for this scenario.

          Show
          jianhe Jian He added a comment - then call rm.handleTransitionToStandBy. I think the ActiveStandbyElector will handle reJoin automatically if exception thrown from the transtionToActive method. We can use the fail fast config for this scenario.
          Hide
          hex108 Jun Gong added a comment -

          Thanks Jian He.

          Attach a new patch to address above comments. It also fix the checkstyle error.

          Show
          hex108 Jun Gong added a comment - Thanks Jian He . Attach a new patch to address above comments. It also fix the checkstyle error.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 16s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
          +1 mvninstall 6m 39s trunk passed
          +1 compile 0m 30s trunk passed
          +1 checkstyle 0m 23s trunk passed
          +1 mvnsite 0m 37s trunk passed
          +1 mvneclipse 0m 17s trunk passed
          +1 findbugs 0m 54s trunk passed
          +1 javadoc 0m 21s trunk passed
          +1 mvninstall 0m 29s the patch passed
          +1 compile 0m 29s the patch passed
          +1 javac 0m 29s the patch passed
          +1 checkstyle 0m 20s the patch passed
          +1 mvnsite 0m 35s the patch passed
          +1 mvneclipse 0m 13s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 1s the patch passed
          +1 javadoc 0m 18s the patch passed
          -1 unit 33m 9s hadoop-yarn-server-resourcemanager in the patch failed.
          +1 asflicense 0m 17s The patch does not generate ASF License warnings.
          47m 25s



          Reason Tests
          Failed junit tests hadoop.yarn.server.resourcemanager.TestRMHA



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12821849/YARN-5333.07.patch
          JIRA Issue YARN-5333
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 71a03eb4b485 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 2d82276
          Default Java 1.8.0_101
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-YARN-Build/12627/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12627/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12627/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/12627/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 16s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 6m 39s trunk passed +1 compile 0m 30s trunk passed +1 checkstyle 0m 23s trunk passed +1 mvnsite 0m 37s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 0m 54s trunk passed +1 javadoc 0m 21s trunk passed +1 mvninstall 0m 29s the patch passed +1 compile 0m 29s the patch passed +1 javac 0m 29s the patch passed +1 checkstyle 0m 20s the patch passed +1 mvnsite 0m 35s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 1s the patch passed +1 javadoc 0m 18s the patch passed -1 unit 33m 9s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 47m 25s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.TestRMHA Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12821849/YARN-5333.07.patch JIRA Issue YARN-5333 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 71a03eb4b485 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 2d82276 Default Java 1.8.0_101 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/12627/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12627/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12627/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/12627/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          hex108 Jun Gong added a comment -

          Fix test case error...

          Show
          hex108 Jun Gong added a comment - Fix test case error...
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          The approach seems looks good. Few things to consider

          1. refreshXXXWithoutCheck does not looks meaning full method name. I think common general pattern can be followed like below. Thoughts?
            public RefreshXXXResponse refreshXXX(RefreshXXXRequest request){
            // ACL and RM check can be combined together
            try{
            refreshXXXX(); // This should includes only loading configuration file and update required field.
            // Success audit log
            }catch{
            // failure audit log
            }
            }
            
            private void refreshXXXX(){
            // load configuration filie
            // refresh XXX fields. 
            }
            
          2. One of my major concern after seeing patch is skipping checkACL which used to verify user for every transition-to-active. But now it is skipped. Does it fine? cc:/Jian He
          3. Test failure is related to patch change. I think this test can be removed only since behavior is changed after this patch.
          Show
          rohithsharma Rohith Sharma K S added a comment - The approach seems looks good. Few things to consider refreshXXXWithoutCheck does not looks meaning full method name. I think common general pattern can be followed like below. Thoughts? public RefreshXXXResponse refreshXXX(RefreshXXXRequest request){ // ACL and RM check can be combined together try { refreshXXXX(); // This should includes only loading configuration file and update required field. // Success audit log } catch { // failure audit log } } private void refreshXXXX(){ // load configuration filie // refresh XXX fields. } One of my major concern after seeing patch is skipping checkACL which used to verify user for every transition-to-active. But now it is skipped. Does it fine? cc:/ Jian He Test failure is related to patch change. I think this test can be removed only since behavior is changed after this patch.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 12s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
          +1 mvninstall 8m 55s trunk passed
          +1 compile 0m 39s trunk passed
          +1 checkstyle 0m 30s trunk passed
          +1 mvnsite 0m 50s trunk passed
          +1 mvneclipse 0m 20s trunk passed
          +1 findbugs 1m 10s trunk passed
          +1 javadoc 0m 25s trunk passed
          +1 mvninstall 0m 37s the patch passed
          +1 compile 0m 39s the patch passed
          +1 javac 0m 39s the patch passed
          +1 checkstyle 0m 25s the patch passed
          +1 mvnsite 0m 46s the patch passed
          +1 mvneclipse 0m 16s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 21s the patch passed
          +1 javadoc 0m 22s the patch passed
          -1 unit 34m 28s hadoop-yarn-server-resourcemanager in the patch failed.
          +1 asflicense 0m 15s The patch does not generate ASF License warnings.
          52m 54s



          Reason Tests
          Failed junit tests hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12821971/YARN-5333.08.patch
          JIRA Issue YARN-5333
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux bb0a82d2a536 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / a1f6564
          Default Java 1.8.0_101
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-YARN-Build/12642/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12642/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12642/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/12642/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 12s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 8m 55s trunk passed +1 compile 0m 39s trunk passed +1 checkstyle 0m 30s trunk passed +1 mvnsite 0m 50s trunk passed +1 mvneclipse 0m 20s trunk passed +1 findbugs 1m 10s trunk passed +1 javadoc 0m 25s trunk passed +1 mvninstall 0m 37s the patch passed +1 compile 0m 39s the patch passed +1 javac 0m 39s the patch passed +1 checkstyle 0m 25s the patch passed +1 mvnsite 0m 46s the patch passed +1 mvneclipse 0m 16s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 21s the patch passed +1 javadoc 0m 22s the patch passed -1 unit 34m 28s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 15s The patch does not generate ASF License warnings. 52m 54s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12821971/YARN-5333.08.patch JIRA Issue YARN-5333 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux bb0a82d2a536 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / a1f6564 Default Java 1.8.0_101 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/12642/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12642/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12642/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/12642/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          hex108 Jun Gong added a comment -

          Thanks Rohith Sharma K S for the review.

          refreshXXXWithoutCheck does not looks meaning full method name. I think common general pattern can be followed like below.

          refreshXXXWithouCheck means that there is no check for refreshXXX. If refreshXXX is acceptable, I'd like to change it.

          One of my major concern after seeing patch is skipping checkACL which used to verify user for every transition-to-active. But now it is skipped.

          I ignored it... It seems that we need add checkACL. How about adding it in refreshAll?

          refreshAll () {
             checkACL("XXX");
             refreshXXX();
             ...
          }
          

          Test failure is related to patch change. I think this test can be removed only since behavior is changed after this patch.

          Yes, it is related, I fixed it in the patch 07.patch. The test case seems useful for testing the case that refreshAll failed. Maybe we could keep it?

          Show
          hex108 Jun Gong added a comment - Thanks Rohith Sharma K S for the review. refreshXXXWithoutCheck does not looks meaning full method name. I think common general pattern can be followed like below. refreshXXXWithouCheck means that there is no check for refreshXXX. If refreshXXX is acceptable, I'd like to change it. One of my major concern after seeing patch is skipping checkACL which used to verify user for every transition-to-active. But now it is skipped. I ignored it... It seems that we need add checkACL. How about adding it in refreshAll ? refreshAll () { checkACL( "XXX" ); refreshXXX(); ... } Test failure is related to patch change. I think this test can be removed only since behavior is changed after this patch. Yes, it is related, I fixed it in the patch 07.patch. The test case seems useful for testing the case that refreshAll failed. Maybe we could keep it?
          Hide
          hex108 Jun Gong added a comment -

          Attach a new patch 09.patch.

          Rename refreshXXXWithoutCheck to refreshXXX.

          Add checkAcls("refreshAll") at the beginning of refreshAll(), then we could check user's ACL.

          Show
          hex108 Jun Gong added a comment - Attach a new patch 09.patch. Rename refreshXXXWithoutCheck to refreshXXX . Add checkAcls("refreshAll") at the beginning of refreshAll() , then we could check user's ACL.
          Hide
          sunilg Sunil G added a comment -

          Hi Jian He and Rohith Sharma K S Jun Gong

          I think the ActiveStandbyElector will handle reJoin automatically if exception thrown from the transtionToActive method.We can use the fail fast config for this scenario.

          I have another view here. If refreshAll fails due to a corrupted capacity-scheduler.xml file when a transitionToActive is happening (IOException will come from refreshQeues) as mentioned in YARN-3893, I think we need to fail-fast. Yes, it will be harsh. But we can avoid a scenario like both RMs in standby (and both RMs will switchover continuously) if I am not wrong. pls correct me if I understood this case wrong. This will happen if config is given not-to fail fast. pls share your thoughts.

          Show
          sunilg Sunil G added a comment - Hi Jian He and Rohith Sharma K S Jun Gong I think the ActiveStandbyElector will handle reJoin automatically if exception thrown from the transtionToActive method.We can use the fail fast config for this scenario. I have another view here. If refreshAll fails due to a corrupted capacity-scheduler.xml file when a transitionToActive is happening (IOException will come from refreshQeues ) as mentioned in YARN-3893 , I think we need to fail-fast. Yes, it will be harsh. But we can avoid a scenario like both RMs in standby (and both RMs will switchover continuously) if I am not wrong. pls correct me if I understood this case wrong. This will happen if config is given not-to fail fast. pls share your thoughts.
          Hide
          jianhe Jian He added a comment -

          I see, I think this makes sense. Otherwise, RM will continuously retry. Jun Gong, your opinion ?

          Show
          jianhe Jian He added a comment - I see, I think this makes sense. Otherwise, RM will continuously retry. Jun Gong , your opinion ?
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 10s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
          +1 mvninstall 8m 13s trunk passed
          +1 compile 0m 38s trunk passed
          +1 checkstyle 0m 26s trunk passed
          +1 mvnsite 0m 44s trunk passed
          +1 mvneclipse 0m 19s trunk passed
          +1 findbugs 1m 8s trunk passed
          +1 javadoc 0m 22s trunk passed
          +1 mvninstall 0m 35s the patch passed
          +1 compile 0m 35s the patch passed
          +1 javac 0m 35s the patch passed
          +1 checkstyle 0m 23s the patch passed
          +1 mvnsite 0m 43s the patch passed
          +1 mvneclipse 0m 17s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 11s the patch passed
          +1 javadoc 0m 19s the patch passed
          +1 unit 34m 27s hadoop-yarn-server-resourcemanager in the patch passed.
          +1 asflicense 0m 18s The patch does not generate ASF License warnings.
          51m 29s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12821997/YARN-5333.09.patch
          JIRA Issue YARN-5333
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux d110d18e4ec2 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 8f1c374
          Default Java 1.8.0_101
          findbugs v3.0.0
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12643/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/12643/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 10s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 8m 13s trunk passed +1 compile 0m 38s trunk passed +1 checkstyle 0m 26s trunk passed +1 mvnsite 0m 44s trunk passed +1 mvneclipse 0m 19s trunk passed +1 findbugs 1m 8s trunk passed +1 javadoc 0m 22s trunk passed +1 mvninstall 0m 35s the patch passed +1 compile 0m 35s the patch passed +1 javac 0m 35s the patch passed +1 checkstyle 0m 23s the patch passed +1 mvnsite 0m 43s the patch passed +1 mvneclipse 0m 17s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 11s the patch passed +1 javadoc 0m 19s the patch passed +1 unit 34m 27s hadoop-yarn-server-resourcemanager in the patch passed. +1 asflicense 0m 18s The patch does not generate ASF License warnings. 51m 29s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12821997/YARN-5333.09.patch JIRA Issue YARN-5333 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux d110d18e4ec2 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 8f1c374 Default Java 1.8.0_101 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12643/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/12643/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          hex108 Jun Gong added a comment -

          Yes, I read comments in YARN-3893 again, I agree with it too. I'll update the patch soon.

          Show
          hex108 Jun Gong added a comment - Yes, I read comments in YARN-3893 again, I agree with it too. I'll update the patch soon.
          Hide
          hex108 Jun Gong added a comment -

          Attach a new patch 10.patch to address above problem.

          Show
          hex108 Jun Gong added a comment - Attach a new patch 10.patch to address above problem.
          Hide
          sunilg Sunil G added a comment -

          Thanks Jun Gong
          Sorry for sharing a late comment.
          I think the test case is more or less a general HA test case which can be common for all schedulers. So will it be better if we place it in TestRMHA.

          Show
          sunilg Sunil G added a comment - Thanks Jun Gong Sorry for sharing a late comment. I think the test case is more or less a general HA test case which can be common for all schedulers. So will it be better if we place it in TestRMHA .
          Hide
          hex108 Jun Gong added a comment -

          Hi Sunil G, in order to reproduce the error case, we need to create some queues, however queues' format is bounded to scheduler. Any suggestion to make it general?

          Show
          hex108 Jun Gong added a comment - Hi Sunil G , in order to reproduce the error case, we need to create some queues, however queues' format is bounded to scheduler. Any suggestion to make it general?
          Hide
          sunilg Sunil G added a comment -

          Yes, you are correct.

          As I see, TestRMRestart uses ParameterizedSchedulerTestBase, which will run for CAPACITY and FAIR scheduler.
          This means you might need to have a scheduler specific code like below identify scheduler and create queue.

              if (getSchedulerType().equals(SchedulerType.CAPACITY)) {
                // do something
              }
          

          This does not look much clean as its related RM Restart. We can wait for input from Jian He and Rohith Sharma K S too.

          Show
          sunilg Sunil G added a comment - Yes, you are correct. As I see, TestRMRestart uses ParameterizedSchedulerTestBase , which will run for CAPACITY and FAIR scheduler. This means you might need to have a scheduler specific code like below identify scheduler and create queue. if (getSchedulerType().equals(SchedulerType.CAPACITY)) { // do something } This does not look much clean as its related RM Restart. We can wait for input from Jian He and Rohith Sharma K S too.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 11s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 7m 42s trunk passed
          +1 compile 0m 33s trunk passed
          +1 checkstyle 0m 24s trunk passed
          +1 mvnsite 0m 38s trunk passed
          +1 mvneclipse 0m 18s trunk passed
          +1 findbugs 1m 3s trunk passed
          +1 javadoc 0m 21s trunk passed
          +1 mvninstall 0m 32s the patch passed
          +1 compile 0m 32s the patch passed
          +1 javac 0m 32s the patch passed
          +1 checkstyle 0m 19s the patch passed
          +1 mvnsite 0m 38s the patch passed
          +1 mvneclipse 0m 15s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 4s the patch passed
          +1 javadoc 0m 20s the patch passed
          -1 unit 36m 36s hadoop-yarn-server-resourcemanager in the patch failed.
          +1 asflicense 0m 37s The patch does not generate ASF License warnings.
          52m 47s



          Reason Tests
          Failed junit tests hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore
            hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12822018/YARN-5333.10.patch
          JIRA Issue YARN-5333
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 6ddd221c9d44 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 08e3338
          Default Java 1.8.0_101
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-YARN-Build/12645/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12645/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12645/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/12645/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 11s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 7m 42s trunk passed +1 compile 0m 33s trunk passed +1 checkstyle 0m 24s trunk passed +1 mvnsite 0m 38s trunk passed +1 mvneclipse 0m 18s trunk passed +1 findbugs 1m 3s trunk passed +1 javadoc 0m 21s trunk passed +1 mvninstall 0m 32s the patch passed +1 compile 0m 32s the patch passed +1 javac 0m 32s the patch passed +1 checkstyle 0m 19s the patch passed +1 mvnsite 0m 38s the patch passed +1 mvneclipse 0m 15s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 4s the patch passed +1 javadoc 0m 20s the patch passed -1 unit 36m 36s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 37s The patch does not generate ASF License warnings. 52m 47s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore   hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12822018/YARN-5333.10.patch JIRA Issue YARN-5333 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 6ddd221c9d44 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 08e3338 Default Java 1.8.0_101 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/12645/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12645/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12645/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/12645/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          hex108 Jun Gong added a comment -

          Test case errors are not related, addressed in YARN-5157 and YARN-5057.

          Show
          hex108 Jun Gong added a comment - Test case errors are not related, addressed in YARN-5157 and YARN-5057 .
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          Overall patch look clean now and good!!
          One thing is I personally feel in-favor of writing test in common to HA. If folks feels it is fine, I am fine to commit it. cc :-/ Jian He

          Show
          rohithsharma Rohith Sharma K S added a comment - Overall patch look clean now and good!! One thing is I personally feel in-favor of writing test in common to HA. If folks feels it is fine, I am fine to commit it. cc :-/ Jian He
          Hide
          jianhe Jian He added a comment -

          Im fine with that, thx

          Show
          jianhe Jian He added a comment - Im fine with that, thx
          Hide
          sunilg Sunil G added a comment -

          In that case, we could keep existing test case itself. +1 from my side.

          Show
          sunilg Sunil G added a comment - In that case, we could keep existing test case itself. +1 from my side.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          Thanks Sunil and Jian, I will commit it shortly.

          Show
          rohithsharma Rohith Sharma K S added a comment - Thanks Sunil and Jian, I will commit it shortly.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          committed to trunk/branch-2! thanks Jun Gong for the patch and thanks Sunil G and Jian He for the reviews!

          Show
          rohithsharma Rohith Sharma K S added a comment - committed to trunk/branch-2! thanks Jun Gong for the patch and thanks Sunil G and Jian He for the reviews!
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #10223 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10223/)
          YARN-5333. Some recovered apps are put into default queue when RM HA. (rohithsharmaks: rev d9a354c2f39274b2810144d1ae133201e44e3bfc)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #10223 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10223/ ) YARN-5333 . Some recovered apps are put into default queue when RM HA. (rohithsharmaks: rev d9a354c2f39274b2810144d1ae133201e44e3bfc) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java
          Hide
          hex108 Jun Gong added a comment -
          Show
          hex108 Jun Gong added a comment - Thanks Rohith Sharma K S , Jian He and Sunil G
          Hide
          ebadger Eric Badger added a comment -

          Sunil G, backporting this to 2.8 broke a unit test. It would also be nice if you could comment on the JIRA when you backport so that it's obvious that the backport was at a different time than the original commit.

          testTransitionedToActiveRefreshFail(org.apache.hadoop.yarn.server.resourcemanager.TestRMHA)  Time elapsed: 2.396 sec  <<< FAILURE!
          java.lang.AssertionError: null
          	at org.junit.Assert.fail(Assert.java:86)
          	at org.junit.Assert.assertTrue(Assert.java:41)
          	at org.junit.Assert.assertTrue(Assert.java:52)
          	at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.testTransitionedToActiveRefreshFail(TestRMHA.java:623)
          
          Show
          ebadger Eric Badger added a comment - Sunil G , backporting this to 2.8 broke a unit test. It would also be nice if you could comment on the JIRA when you backport so that it's obvious that the backport was at a different time than the original commit. testTransitionedToActiveRefreshFail(org.apache.hadoop.yarn.server.resourcemanager.TestRMHA) Time elapsed: 2.396 sec <<< FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.testTransitionedToActiveRefreshFail(TestRMHA.java:623)
          Hide
          sunilg Sunil G added a comment -

          Yes Eric Badger. I will make sure necessary comments are also added in main ticket as well. For now, YARN-4927 will be backported to fix the test error in branch-2.8 which originally fixed test error in trunk as well.

          Show
          sunilg Sunil G added a comment - Yes Eric Badger . I will make sure necessary comments are also added in main ticket as well. For now, YARN-4927 will be backported to fix the test error in branch-2.8 which originally fixed test error in trunk as well.

            People

            • Assignee:
              hex108 Jun Gong
              Reporter:
              hex108 Jun Gong
            • Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development