Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-149 [Umbrella] ResourceManager (RM) Fail-over
  3. YARN-3893

Both RM in active state when Admin#transitionToActive failure from refeshAll()

    Details

    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Cases that can cause this.

      1. Capacity scheduler xml is wrongly configured during switch
      2. Refresh ACL failure due to configuration
      3. Refresh User group failure due to configuration

      Continuously both RM will try to be active

      
      dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> ./yarn rmadmin  -getServiceState rm1
      15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      active
      dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> ./yarn rmadmin  -getServiceState rm2
      15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      active
      
      
      1. Both Web UI active
      2. Status shown as active for both RM
      1. 0001-YARN-3893.patch
        4 kB
        Bibin A Chundatt
      2. 0002-YARN-3893.patch
        4 kB
        Bibin A Chundatt
      3. 0003-YARN-3893.patch
        4 kB
        Bibin A Chundatt
      4. 0004-YARN-3893.patch
        7 kB
        Bibin A Chundatt
      5. 0005-YARN-3893.patch
        8 kB
        Bibin A Chundatt
      6. 0006-YARN-3893.patch
        7 kB
        Bibin A Chundatt
      7. 0007-YARN-3893.patch
        7 kB
        Bibin A Chundatt
      8. 0008-YARN-3893.patch
        7 kB
        Bibin A Chundatt
      9. 0009-YARN-3893.patch
        7 kB
        Bibin A Chundatt
      10. 0010-YARN-3893.patch
        8 kB
        Bibin A Chundatt
      11. yarn-site.xml
        10 kB
        Bibin A Chundatt

        Issue Links

          Activity

          Hide
          xgong Xuan Gong added a comment -

          Thanks for reporting this. Could you share the YARN configurations, please ? Bibin A Chundatt

          Show
          xgong Xuan Gong added a comment - Thanks for reporting this. Could you share the YARN configurations, please ? Bibin A Chundatt
          Hide
          bibinchundatt Bibin A Chundatt added a comment -

          Xuan Gong Sorry for late reponse.
          -Also make sure that in capacity scheduler xml configuration should fail with nodeLabel

          Show
          bibinchundatt Bibin A Chundatt added a comment - Xuan Gong Sorry for late reponse. -Also make sure that in capacity scheduler xml configuration should fail with nodeLabel
          Hide
          bibinchundatt Bibin A Chundatt added a comment -

          Updated description since the cases can happen in many cases . Please do correct me if i am wrong

          Show
          bibinchundatt Bibin A Chundatt added a comment - Updated description since the cases can happen in many cases . Please do correct me if i am wrong
          Hide
          bibinchundatt Bibin A Chundatt added a comment -

          Sorry typo. Both RM active can happen in many cases.

          Show
          bibinchundatt Bibin A Chundatt added a comment - Sorry typo. Both RM active can happen in many cases.
          Hide
          sunilg Sunil G added a comment -

          Thank you Bibin A Chundatt. Could you please attach CS xml too.

          Show
          sunilg Sunil G added a comment - Thank you Bibin A Chundatt . Could you please attach CS xml too.
          Hide
          bibinchundatt Bibin A Chundatt added a comment -

          Thnks Sunil G for checking the issue. In this jira we should decide how to handle RefreshAll() failure during transistion to active. The configuration mistakes like capacityscheduler.xml , Acl , user group mapping can cause both RM active case during switch due to zk connection error probably.

          At runtime i am not sure again we will be able to recover once this happens.

          Capacity scheduler causing this case is one of them.
          YARN-3894 contains the CS xml.

          Show
          bibinchundatt Bibin A Chundatt added a comment - Thnks Sunil G for checking the issue. In this jira we should decide how to handle RefreshAll() failure during transistion to active. The configuration mistakes like capacityscheduler.xml , Acl , user group mapping can cause both RM active case during switch due to zk connection error probably. At runtime i am not sure again we will be able to recover once this happens. Capacity scheduler causing this case is one of them. YARN-3894 contains the CS xml.
          Hide
          xgong Xuan Gong added a comment -

          How about add rm.transitionToStandby(true) before we throw the ServiceFailedException in catch block ?

              try {
                rm.transitionToActive();
                // call all refresh*s for active RM to get the updated configurations.
                refreshAll();
                RMAuditLogger.logSuccess(user.getShortUserName(),
                    "transitionToActive", "RMHAProtocolService");
              } catch (Exception e) {
                RMAuditLogger.logFailure(user.getShortUserName(), "transitionToActive",
                    "", "RMHAProtocolService",
                    "Exception transitioning to active");
                throw new ServiceFailedException(
                    "Error when transitioning to Active mode", e);
              }
          

          In that case, we could transit the RM to standby, and since we throw out the ServiceFailedException, this RM will rejoin the leader election process.

          Show
          xgong Xuan Gong added a comment - How about add rm.transitionToStandby(true) before we throw the ServiceFailedException in catch block ? try { rm.transitionToActive(); // call all refresh*s for active RM to get the updated configurations. refreshAll(); RMAuditLogger.logSuccess(user.getShortUserName(), "transitionToActive" , "RMHAProtocolService" ); } catch (Exception e) { RMAuditLogger.logFailure(user.getShortUserName(), "transitionToActive" , "", " RMHAProtocolService", "Exception transitioning to active" ); throw new ServiceFailedException( "Error when transitioning to Active mode" , e); } In that case, we could transit the RM to standby, and since we throw out the ServiceFailedException, this RM will rejoin the leader election process.
          Hide
          sunilg Sunil G added a comment -

          Hi Xuan Gong
          Thank you for the update. I have a doubt here.

          If we call rm.transitionToStandby(true) , then it will result a call to ResourceManager#createAndInitActiveServices().
          So is it possible that we may get the same exception which we got from refreshAll call earlier. Specifically queue reinitialize. Currently the CS#serviceInit will call parseQueues. As mentioned here, Bibin A Chundatt used a wrong CS xml file.

          Show
          sunilg Sunil G added a comment - Hi Xuan Gong Thank you for the update. I have a doubt here. If we call rm.transitionToStandby(true) , then it will result a call to ResourceManager#createAndInitActiveServices(). So is it possible that we may get the same exception which we got from refreshAll call earlier. Specifically queue reinitialize. Currently the CS#serviceInit will call parseQueues. As mentioned here, Bibin A Chundatt used a wrong CS xml file.
          Hide
          varun_saxena Varun Saxena added a comment -

          Maybe set the HA service state in RM context as STANDBY upon throwing the exception. Or not set it to ACTIVE till the all active services are actually started.
          We primarily check RM context to make the decision about whether RM is in standby state or active.

          Show
          varun_saxena Varun Saxena added a comment - Maybe set the HA service state in RM context as STANDBY upon throwing the exception. Or not set it to ACTIVE till the all active services are actually started. We primarily check RM context to make the decision about whether RM is in standby state or active.
          Hide
          sunilg Sunil G added a comment -

          refreshAll() is doing many set of refresh operations. And exception may come from any state. Its better to gracefully close those. So setting state directly wont help much, we may need to go through part of transitionToStandby.

          Show
          sunilg Sunil G added a comment - refreshAll() is doing many set of refresh operations. And exception may come from any state. Its better to gracefully close those. So setting state directly wont help much, we may need to go through part of transitionToStandby.
          Hide
          varun_saxena Varun Saxena added a comment -

          Sunil G
          We can do the cleanup(i.e. stop active services) when we switch to standby. We do this already. Also cleanup will be done when we stop RM. So this shouldn't be an issue.

          What is happening is as under :

          Let us assume there is RM1 and RM2.
          Basically, when exception occurs, RM1 waits for RM2 to become active and joins leader election again. As both RMs' have wrong configuration, RM1 will try to become active again(and not switch to standby) after RM2 has tried the same.
          Now, as the problem is in call to refreshAll, both RMs' would be marked as ACTIVE in their respective RM Contexts. Because we set it to ACTIVE before calling refreshAll.

          The problem reported here is that RM is shown as Active when it is not actually ACTIVE i.e. UI is accessible and getServiceState returns both RM as Active. And when we access UI or get service state we check what's the state in RM Context. And that is ACTIVE.
          So for anyone who is accessing RM from command line or via UI, RM is active(because RM context says so), when it is not really active. Both RMs' are just trying incessantly to become active and failing.

          That is why I suggested that we can update the RM Context. Infact changing RM context is necessary. We can decide when to stop active services, if at all.

          So there are 2 options :

          1. We can set RM context to standby when exception occurs and stop active services. But if we do it, this would mean we will have to redo the work of starting active services again if this RM were to become ACTIVE.
          2. Introduce a new state (say WAITING_FOR_ACTIVE) and set this state when exception is thrown and check this state to stop active services when switching to standby. And not starting the services again in case of switching to ACTIVE.

          Thoughts, Sunil G, Xuan Gong ?

          Show
          varun_saxena Varun Saxena added a comment - Sunil G We can do the cleanup(i.e. stop active services) when we switch to standby. We do this already. Also cleanup will be done when we stop RM. So this shouldn't be an issue. What is happening is as under : Let us assume there is RM1 and RM2. Basically, when exception occurs, RM1 waits for RM2 to become active and joins leader election again. As both RMs' have wrong configuration, RM1 will try to become active again(and not switch to standby) after RM2 has tried the same. Now, as the problem is in call to refreshAll , both RMs' would be marked as ACTIVE in their respective RM Contexts. Because we set it to ACTIVE before calling refreshAll. The problem reported here is that RM is shown as Active when it is not actually ACTIVE i.e. UI is accessible and getServiceState returns both RM as Active. And when we access UI or get service state we check what's the state in RM Context. And that is ACTIVE. So for anyone who is accessing RM from command line or via UI, RM is active( because RM context says so ), when it is not really active. Both RMs' are just trying incessantly to become active and failing. That is why I suggested that we can update the RM Context. Infact changing RM context is necessary. We can decide when to stop active services, if at all. So there are 2 options : We can set RM context to standby when exception occurs and stop active services. But if we do it, this would mean we will have to redo the work of starting active services again if this RM were to become ACTIVE. Introduce a new state (say WAITING_FOR_ACTIVE) and set this state when exception is thrown and check this state to stop active services when switching to standby. And not starting the services again in case of switching to ACTIVE. Thoughts, Sunil G , Xuan Gong ?
          Hide
          varun_saxena Varun Saxena added a comment -

          For 2nd option, we will have to return STANDBY to client if the state is WAITING_FOR_ACTIVE. So it can primarily be a RM internal state.

          Show
          varun_saxena Varun Saxena added a comment - For 2nd option, we will have to return STANDBY to client if the state is WAITING_FOR_ACTIVE. So it can primarily be a RM internal state.
          Hide
          sunilg Sunil G added a comment -

          Thanks Varun Saxena for sharing detailed analysis. Infact we must change the state in context.

          IMO, I feel we can stop active services, and move the RM state to Standby. With this, RM will become another candidate for election. If any case when the same RM is selected as active, and if we have good config, then with existing call flow startActiveServices will be invoked. So it should be fine in that case. From UI also, both RM will be shown as Standby too.

          Show
          sunilg Sunil G added a comment - Thanks Varun Saxena for sharing detailed analysis. Infact we must change the state in context. IMO, I feel we can stop active services, and move the RM state to Standby. With this, RM will become another candidate for election. If any case when the same RM is selected as active, and if we have good config, then with existing call flow startActiveServices will be invoked. So it should be fine in that case. From UI also, both RM will be shown as Standby too.
          Hide
          varun_saxena Varun Saxena added a comment -

          Yeah lets go with first option I suggested then i.e. make RM Context as standby and stop active services followed by initialization. That will be easier to implement.
          This will resolve the issue.

          Show
          varun_saxena Varun Saxena added a comment - Yeah lets go with first option I suggested then i.e. make RM Context as standby and stop active services followed by initialization. That will be easier to implement. This will resolve the issue.
          Hide
          bibinchundatt Bibin A Chundatt added a comment -

          Thanks Varun Saxena and Sunil G first option look good and easier to implement.
          But both RM could be in standBy state. but looks like the best option.

          Show
          bibinchundatt Bibin A Chundatt added a comment - Thanks Varun Saxena and Sunil G first option look good and easier to implement. But both RM could be in standBy state. but looks like the best option.
          Hide
          bibinchundatt Bibin A Chundatt added a comment -

          Varun Saxena and Sunil G . Only need to call rm.transitionToStandby(false) on exception .
          Since it handles transition to standby in rm context,Stop active services and not reinitializing queues.

          Show
          bibinchundatt Bibin A Chundatt added a comment - Varun Saxena and Sunil G . Only need to call rm.transitionToStandby(false) on exception . Since it handles transition to standby in rm context,Stop active services and not reinitializing queues.
          Hide
          varun_saxena Varun Saxena added a comment -

          Reinitialization of Active Services is required. When you call stop active services, service state for all the services will change to STOPPED.
          If this RM were to become active again, we will try to start all the active services and services cant transition to START state from STOPPED state. They can only do so when services are in INIT state.

          Show
          varun_saxena Varun Saxena added a comment - Reinitialization of Active Services is required . When you call stop active services, service state for all the services will change to STOPPED. If this RM were to become active again, we will try to start all the active services and services cant transition to START state from STOPPED state. They can only do so when services are in INIT state.
          Hide
          sunilg Sunil G added a comment -

          Hi Varun Saxena

          Reinitialization of Active Services is required.

          For this, I think calling rm.transitionToStandby(true) is not a good idea. Because same exception can come while initializing CapacityScheduler (cs config file).

          Show
          sunilg Sunil G added a comment - Hi Varun Saxena Reinitialization of Active Services is required. For this, I think calling rm.transitionToStandby(true) is not a good idea. Because same exception can come while initializing CapacityScheduler (cs config file).
          Hide
          sunilg Sunil G added a comment -

          Hi Varun Saxena

          Reinitialization of Active Services is required.

          For this, I think calling rm.transitionToStandby(true) is not a good idea. Because same exception can come while initializing CapacityScheduler (cs config file).

          Show
          sunilg Sunil G added a comment - Hi Varun Saxena Reinitialization of Active Services is required. For this, I think calling rm.transitionToStandby(true) is not a good idea. Because same exception can come while initializing CapacityScheduler (cs config file).
          Hide
          xgong Xuan Gong added a comment -

          How about first calling rm.transitionToStandby(false), then call activeService.reinitiate() (probably need to create this function) ? At least, the RM will transit to Standby. Even if the reinitiate() throws exception, the leader elector will handle this.

          Show
          xgong Xuan Gong added a comment - How about first calling rm.transitionToStandby(false), then call activeService.reinitiate() (probably need to create this function) ? At least, the RM will transit to Standby. Even if the reinitiate() throws exception, the leader elector will handle this.
          Hide
          xgong Xuan Gong added a comment -

          Sorry, should call stopAndReinitiate().

          Show
          xgong Xuan Gong added a comment - Sorry, should call stopAndReinitiate().
          Hide
          sunilg Sunil G added a comment -

          Hi Xuan Gong
          Yes, we can do that. But I feel now we call rmContext.setHAServiceState(HAServiceProtocol.HAServiceState.STANDBY);
          as last statement in transitionToStandby. SO if an exception happens in reinitialize code flow wont reach to set the state as Standby. So we may also need to set the state in context as Standby.

          Show
          sunilg Sunil G added a comment - Hi Xuan Gong Yes, we can do that. But I feel now we call rmContext.setHAServiceState(HAServiceProtocol.HAServiceState.STANDBY); as last statement in transitionToStandby . SO if an exception happens in reinitialize code flow wont reach to set the state as Standby. So we may also need to set the state in context as Standby.
          Hide
          varun_saxena Varun Saxena added a comment -

          Yes but we do need to reinitialize services. Otherwise transition to active when everything is fine will not happen.

          Show
          varun_saxena Varun Saxena added a comment - Yes but we do need to reinitialize services. Otherwise transition to active when everything is fine will not happen.
          Hide
          varun_saxena Varun Saxena added a comment -

          We do need to stop active services because many threads would be spawned on attempt to transition to active.
          Frankly, we can have a additional flag in RM indicating that reinitialization of services is required and attempt them while trying for transition to active. We can stop the services beforehand because no point having some threads running in standby. Thoughts ?
          We can do something like below

                // Exception was thrown in call to refreshAll.
                if (rmContext.getHAServiceState() ==
                    HAServiceProtocol.HAServiceState.ACTIVE) {
                ((RMContextImpl)rmContext).setHAServiceState(HAServiceProtocol.HAServiceState.STANDBY);
                  try {
                    rm.stopActiveServices();
                    // set a flag in RM(maybe rm context) indicating reinit of services is required on trying for transition to active despite state being standby.
                  } catch (Exception ex) {        
                  }
          
          Show
          varun_saxena Varun Saxena added a comment - We do need to stop active services because many threads would be spawned on attempt to transition to active. Frankly, we can have a additional flag in RM indicating that reinitialization of services is required and attempt them while trying for transition to active. We can stop the services beforehand because no point having some threads running in standby. Thoughts ? We can do something like below // Exception was thrown in call to refreshAll. if (rmContext.getHAServiceState() == HAServiceProtocol.HAServiceState.ACTIVE) { ((RMContextImpl)rmContext).setHAServiceState(HAServiceProtocol.HAServiceState.STANDBY); try { rm.stopActiveServices(); // set a flag in RM(maybe rm context) indicating reinit of services is required on trying for transition to active despite state being standby. } catch (Exception ex) { }
          Hide
          sunilg Sunil G added a comment -

          I remember an earlier suggestion of new HAServiceState.
          Introducing a state as WAITING_FOR_ACTIVE may help to do all reinit or other inits when we try to move to ACTIVE. Also as mentioned earlier, this can be hidden state internally. It may look more cleaner than flag. So along with above solution, could we add this new state also?

          Show
          sunilg Sunil G added a comment - I remember an earlier suggestion of new HAServiceState. Introducing a state as WAITING_FOR_ACTIVE may help to do all reinit or other inits when we try to move to ACTIVE. Also as mentioned earlier, this can be hidden state internally. It may look more cleaner than flag. So along with above solution, could we add this new state also?
          Hide
          varun_saxena Varun Saxena added a comment -

          Xuan Gong, issue with reinitialization is that if exception is thrown during initialization then all the active services will be stopped.
          And when we transitiontoactive we will directly attempt to start active services which would fail because services are in state STOPPED.

          I think we can forcibly set the state to standby and set a flag in RMContext indicating reinit is required whenever attempting transition to standby or active. This way we will let leader election handle the exception.

          Show
          varun_saxena Varun Saxena added a comment - Xuan Gong , issue with reinitialization is that if exception is thrown during initialization then all the active services will be stopped. And when we transitiontoactive we will directly attempt to start active services which would fail because services are in state STOPPED. I think we can forcibly set the state to standby and set a flag in RMContext indicating reinit is required whenever attempting transition to standby or active. This way we will let leader election handle the exception.
          Hide
          varun_saxena Varun Saxena added a comment -

          Yeah Sunil G, a new state can also be introduced as I suggested. This would act similar to a flag.

          Show
          varun_saxena Varun Saxena added a comment - Yeah Sunil G , a new state can also be introduced as I suggested. This would act similar to a flag.
          Hide
          varun_saxena Varun Saxena added a comment -

          Xuan Gong, thoughts about introducing a new internal state ?

          We have to be very careful though when we do this as this is a very sensitive piece of code.

          Show
          varun_saxena Varun Saxena added a comment - Xuan Gong , thoughts about introducing a new internal state ? We have to be very careful though when we do this as this is a very sensitive piece of code.
          Hide
          xgong Xuan Gong added a comment -

          Thanks for Varun Saxena and Sunil G. I am fine with adding a new internal state although it might be too complex. But if we could handle this correctly, I am fine with this.

          To this specific issue, I think that at least two things we should do here:
          1) stop All ActiveService
          2) transit to standby. (basically, set RM state in RMContext as Standby)
          But, we also need to reinitiate all the active service to prepare for the transitToActive call.
          At least, we should do:

          rm.transitToStandy(false);
          reinitiateActiveService();
          

          Here the reinitiateActiveService() can throw out the same exception. And I can see why this does not solve the whole problem.

          How about we introduce a new atomicBoolean flag to track whether we need to reinitiate active service ? And we could add following into transitToActive logic

              if (reinitiateRequired)
                 reinitiateActiveService()
          

          before we start all the active service.

          Show
          xgong Xuan Gong added a comment - Thanks for Varun Saxena and Sunil G . I am fine with adding a new internal state although it might be too complex. But if we could handle this correctly, I am fine with this. To this specific issue, I think that at least two things we should do here: 1) stop All ActiveService 2) transit to standby. (basically, set RM state in RMContext as Standby) But, we also need to reinitiate all the active service to prepare for the transitToActive call. At least, we should do: rm.transitToStandy( false ); reinitiateActiveService(); Here the reinitiateActiveService() can throw out the same exception. And I can see why this does not solve the whole problem. How about we introduce a new atomicBoolean flag to track whether we need to reinitiate active service ? And we could add following into transitToActive logic if (reinitiateRequired) reinitiateActiveService() before we start all the active service.
          Hide
          varun_saxena Varun Saxena added a comment -

          Ok, lets add a flag. According to me, we need to check this flag and do reinit even on transitionToStandby even though state is standby.

          Show
          varun_saxena Varun Saxena added a comment - Ok, lets add a flag. According to me, we need to check this flag and do reinit even on transitionToStandby even though state is standby.
          Hide
          sunilg Sunil G added a comment -

          +1 for using atomicBoolean flag.
          Do we really need to call reinitiateActiveService from transitionToStandby. I think it can be done while we invoke transitionToActive when it matters.

          Show
          sunilg Sunil G added a comment - +1 for using atomicBoolean flag. Do we really need to call reinitiateActiveService from transitionToStandby . I think it can be done while we invoke transitionToActive when it matters.
          Hide
          varun_saxena Varun Saxena added a comment -

          Sunil G, even I was suggesting earlier in my comment that we reinit only while transitioning to active.

          But then I thought that if we reinit on standby and there is a problem in initialization, a failure can indicate the admin to correct his config. An audit log will be printed.
          If we do not reinit, a success audit log on transition to standby would be printed, which may indicate no problem in config to admin.
          Thoughts ?

          We can procrastinate reiniting till transition to active as well. But its better to indicate a failure even on standby IMHO. I do not see any harm in it.
          I am fine either ways because reiniting really matters when transitioning to active.

          Show
          varun_saxena Varun Saxena added a comment - Sunil G , even I was suggesting earlier in my comment that we reinit only while transitioning to active. But then I thought that if we reinit on standby and there is a problem in initialization, a failure can indicate the admin to correct his config. An audit log will be printed. If we do not reinit, a success audit log on transition to standby would be printed, which may indicate no problem in config to admin. Thoughts ? We can procrastinate reiniting till transition to active as well. But its better to indicate a failure even on standby IMHO. I do not see any harm in it. I am fine either ways because reiniting really matters when transitioning to active.
          Hide
          varun_saxena Varun Saxena added a comment -

          I am fine either ways though because as you said reiniting really matters when transitioning to active.

          Show
          varun_saxena Varun Saxena added a comment - I am fine either ways though because as you said reiniting really matters when transitioning to active.
          Hide
          bibinchundatt Bibin A Chundatt added a comment -

          Sunil G, Varun Saxena and Xuan Gong Thanks a lot for comments.
          Please review

          Show
          bibinchundatt Bibin A Chundatt added a comment - Sunil G , Varun Saxena and Xuan Gong Thanks a lot for comments. Please review
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 16m 8s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 tests included 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 javac 7m 42s There were no new javac warning messages.
          +1 javadoc 9m 39s There were no new javadoc warning messages.
          +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings.
          +1 checkstyle 0m 46s There were no new checkstyle issues.
          +1 whitespace 0m 0s The patch has no lines that end in whitespace.
          +1 install 1m 20s mvn install still works.
          +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse.
          +1 findbugs 1m 25s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          -1 yarn tests 51m 13s Tests failed in hadoop-yarn-server-resourcemanager.
              89m 13s  



          Reason Tests
          Failed unit tests hadoop.yarn.server.resourcemanager.TestRMHA



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12745382/0002-YARN-3893.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / 0a16ee6
          hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8539/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8539/testReport/
          Java 1.7.0_55
          uname Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/8539/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 16m 8s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. -1 tests included 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac 7m 42s There were no new javac warning messages. +1 javadoc 9m 39s There were no new javadoc warning messages. +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 0m 46s There were no new checkstyle issues. +1 whitespace 0m 0s The patch has no lines that end in whitespace. +1 install 1m 20s mvn install still works. +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse. +1 findbugs 1m 25s The patch does not introduce any new Findbugs (version 3.0.0) warnings. -1 yarn tests 51m 13s Tests failed in hadoop-yarn-server-resourcemanager.     89m 13s   Reason Tests Failed unit tests hadoop.yarn.server.resourcemanager.TestRMHA Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12745382/0002-YARN-3893.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 0a16ee6 hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8539/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8539/testReport/ Java 1.7.0_55 uname Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8539/console This message was automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 18m 40s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 tests included 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 javac 8m 52s There were no new javac warning messages.
          +1 javadoc 11m 1s There were no new javadoc warning messages.
          +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings.
          +1 checkstyle 0m 57s There were no new checkstyle issues.
          -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix.
          +1 install 1m 31s mvn install still works.
          +1 eclipse:eclipse 0m 36s The patch built with eclipse:eclipse.
          +1 findbugs 1m 44s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          +1 yarn tests 51m 58s Tests passed in hadoop-yarn-server-resourcemanager.
              95m 44s  



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12745400/0003-YARN-3893.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / edcaae4
          whitespace https://builds.apache.org/job/PreCommit-YARN-Build/8540/artifact/patchprocess/whitespace.txt
          hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8540/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8540/testReport/
          Java 1.7.0_55
          uname Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/8540/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 18m 40s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. -1 tests included 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac 8m 52s There were no new javac warning messages. +1 javadoc 11m 1s There were no new javadoc warning messages. +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 0m 57s There were no new checkstyle issues. -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. +1 install 1m 31s mvn install still works. +1 eclipse:eclipse 0m 36s The patch built with eclipse:eclipse. +1 findbugs 1m 44s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 yarn tests 51m 58s Tests passed in hadoop-yarn-server-resourcemanager.     95m 44s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12745400/0003-YARN-3893.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / edcaae4 whitespace https://builds.apache.org/job/PreCommit-YARN-Build/8540/artifact/patchprocess/whitespace.txt hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8540/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8540/testReport/ Java 1.7.0_55 uname Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8540/console This message was automatically generated.
          Hide
          varun_saxena Varun Saxena added a comment -

          Thanks for the patch Bibin A Chundatt. Few comments.

          1. Nit : Should be "Exception in state transition"
                      throw new ServiceFailedException(
                          "Exception in state transistion", re);
            
          2. IMO, no need to throw ServiceFailedException when catching exception while calling reinitialize. The throw below should suffice. Just set the flag. According to me, we should retain the original exception.
          3. Add a comment indicating what the flag does.
          4. Maybe rename the flag to reinitActiveServices instead of reinitialize.
          5. The flag according to me, semantically speaking, doesn't quite belong to AdminService. Can be in ResourceManager or RMContext. Thoughts ?
          6. Can you add a test to verify the fix ?
          7. I think instead of relying on transitionToStandby to change state to standby, we can explicitly change the state in AdminService. Thats because even stopActiveServices can throw an Exception and if it does, state won't change to STANDBY. This call to stop should not throw an exception, but as services keep on getting added you never know how a particular service may behave. We should be immune to it. Try something like below.
             ((RMContextImpl)rmContext).setHAServiceState(HAServiceProtocol.HAServiceState.STANDBY);
            
          8. Just a suggestion. If we do above, maybe call stopActiveServices and reinitialize directly instead of calling transitonToStandby. This is because as I said in a comment above, transitionToStandby would print an audit log saying transition is successful. But reinitialize subsequently may fail. And not printing this audit log will be consistent with transitionToActive failing during starting active services. Thoughts ?
          Show
          varun_saxena Varun Saxena added a comment - Thanks for the patch Bibin A Chundatt . Few comments. Nit : Should be "Exception in state transition" throw new ServiceFailedException( "Exception in state transistion" , re); IMO, no need to throw ServiceFailedException when catching exception while calling reinitialize. The throw below should suffice. Just set the flag. According to me, we should retain the original exception. Add a comment indicating what the flag does. Maybe rename the flag to reinitActiveServices instead of reinitialize. The flag according to me, semantically speaking, doesn't quite belong to AdminService. Can be in ResourceManager or RMContext. Thoughts ? Can you add a test to verify the fix ? I think instead of relying on transitionToStandby to change state to standby, we can explicitly change the state in AdminService. Thats because even stopActiveServices can throw an Exception and if it does, state won't change to STANDBY. This call to stop should not throw an exception, but as services keep on getting added you never know how a particular service may behave. We should be immune to it. Try something like below. ((RMContextImpl)rmContext).setHAServiceState(HAServiceProtocol.HAServiceState.STANDBY); Just a suggestion. If we do above, maybe call stopActiveServices and reinitialize directly instead of calling transitonToStandby. This is because as I said in a comment above, transitionToStandby would print an audit log saying transition is successful. But reinitialize subsequently may fail. And not printing this audit log will be consistent with transitionToActive failing during starting active services. Thoughts ?
          Hide
          bibinchundatt Bibin A Chundatt added a comment -

          Attaching patch after comment update and adding testcase

          Show
          bibinchundatt Bibin A Chundatt added a comment - Attaching patch after comment update and adding testcase
          Hide
          hadoopqa Hadoop QA added a comment -



          +1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 18m 30s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
          +1 javac 8m 51s There were no new javac warning messages.
          +1 javadoc 10m 50s There were no new javadoc warning messages.
          +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings.
          +1 checkstyle 0m 54s There were no new checkstyle issues.
          +1 whitespace 0m 0s The patch has no lines that end in whitespace.
          +1 install 1m 29s mvn install still works.
          +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse.
          +1 findbugs 1m 40s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          +1 yarn tests 52m 6s Tests passed in hadoop-yarn-server-resourcemanager.
              95m 19s  



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12745452/0004-YARN-3893.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / edcaae4
          hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8547/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8547/testReport/
          Java 1.7.0_55
          uname Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/8547/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 pre-patch 18m 30s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. +1 javac 8m 51s There were no new javac warning messages. +1 javadoc 10m 50s There were no new javadoc warning messages. +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 0m 54s There were no new checkstyle issues. +1 whitespace 0m 0s The patch has no lines that end in whitespace. +1 install 1m 29s mvn install still works. +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse. +1 findbugs 1m 40s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 yarn tests 52m 6s Tests passed in hadoop-yarn-server-resourcemanager.     95m 19s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12745452/0004-YARN-3893.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / edcaae4 hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8547/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8547/testReport/ Java 1.7.0_55 uname Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8547/console This message was automatically generated.
          Hide
          varun_saxena Varun Saxena added a comment -

          Bibin A Chundatt

          1. Instead of checking for exception message in test, can you check for ServiceFailedException ?
          2. Can you add a verification in the test to check whether active services were stopped ?
          Show
          varun_saxena Varun Saxena added a comment - Bibin A Chundatt Instead of checking for exception message in test, can you check for ServiceFailedException ? Can you add a verification in the test to check whether active services were stopped ?
          Hide
          bibinchundatt Bibin A Chundatt added a comment -

          Instead of checking for exception message in test, can you check for ServiceFailedException

          Already the same is verified in may testcases using messages.

          Can you add a verification in the test to check whether active services were stopped ?

          IMO its not required.

          Show
          bibinchundatt Bibin A Chundatt added a comment - Instead of checking for exception message in test, can you check for ServiceFailedException Already the same is verified in may testcases using messages. Can you add a verification in the test to check whether active services were stopped ? IMO its not required.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          Sorry for coming very late.. This issue has become stale, need to move forward!!
          Regarding the patch,

          1. Instead of setting boolean flag for reinitActiveServices in AdminService and other changes, moving createAndInitActiveServices(); from transitionedToStandby to just before starting activeServices would solve such issues. And on exception transitioningToActive, handle add method stopActiveServices in ResourceManager#transitioningToActive() only.
          2. Probably we can remove refreshAll() from AdminService#transitioneToActive if the above approach.

          Any thoughts?

          Show
          rohithsharma Rohith Sharma K S added a comment - Sorry for coming very late.. This issue has become stale, need to move forward!! Regarding the patch, Instead of setting boolean flag for reinitActiveServices in AdminService and other changes, moving createAndInitActiveServices(); from transitionedToStandby to just before starting activeServices would solve such issues. And on exception transitioningToActive, handle add method stopActiveServices in ResourceManager#transitioningToActive() only. Probably we can remove refreshAll() from AdminService#transitioneToActive if the above approach. Any thoughts?
          Hide
          bibinchundatt Bibin A Chundatt added a comment -

          Hi Rohith Sharma K S
          Thank you for your review comments
          Will update the same and upload patch soon.

          Show
          bibinchundatt Bibin A Chundatt added a comment - Hi Rohith Sharma K S Thank you for your review comments Will update the same and upload patch soon.
          Hide
          sunilg Sunil G added a comment -

          Hi Rohith Sharma K S
          Thank you for restarting this thread.

          The idea of calling createAndInitActiveServices from both ResourceManager#transitionToActive() and transistionToStandby is good . In this case, we can remove the call to refreshAll from AdminService#transistionToStandby.

          Show
          sunilg Sunil G added a comment - Hi Rohith Sharma K S Thank you for restarting this thread. The idea of calling createAndInitActiveServices from both ResourceManager#transitionToActive() and transistionToStandby is good . In this case, we can remove the call to refreshAll from AdminService#transistionToStandby.
          Hide
          sunilg Sunil G added a comment -

          Hi Rohith Sharma K S
          On a second thought, could we move refreshAll in AdminService#transitionToStandby/Active ahead of rm.transitionToStandby/Active

              try {
                // call all refresh*s for active RM to get the updated configurations.
                refreshAll();
                rm.transitionToActive();
                RMAuditLogger.logSuccess(user.getShortUserName(),
                    "transitionToActive", "RMHAProtocolService");
              } catch (Exception e) {
                RMAuditLogger.logFailure(user.getShortUserName(), "transitionToActive",
                    "", "RMHAProtocolService",
                    "Exception transitioning to active");
                throw new ServiceFailedException(
                    "Error when transitioning to Active mode", e);
              }
          

          Hence exception can come before invoking transition methods in ResourceManager class. Thoughts?

          Show
          sunilg Sunil G added a comment - Hi Rohith Sharma K S On a second thought, could we move refreshAll in AdminService#transitionToStandby/Active ahead of rm.transitionToStandby/Active try { // call all refresh*s for active RM to get the updated configurations. refreshAll(); rm.transitionToActive(); RMAuditLogger.logSuccess(user.getShortUserName(), "transitionToActive" , "RMHAProtocolService" ); } catch (Exception e) { RMAuditLogger.logFailure(user.getShortUserName(), "transitionToActive" , "", " RMHAProtocolService", "Exception transitioning to active" ); throw new ServiceFailedException( "Error when transitioning to Active mode" , e); } Hence exception can come before invoking transition methods in ResourceManager class. Thoughts?
          Hide
          bibinchundatt Bibin A Chundatt added a comment -

          Hi Rohith Sharma K S
          Any comments on this.?

          Show
          bibinchundatt Bibin A Chundatt added a comment - Hi Rohith Sharma K S Any comments on this.?
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          I had closer look at either of the solutions as above. One of the potential issue in both are

          1. Moving createAndInitService just before starting activeServices in transitionToActive.
            1. switch time will be impacted since every transitionToActive initializes active services.
            2. And RMWebApp has dependency on clienRMService for starting webapps. Without clientRMService initialization, RMWebapp can not be started.
          2. Moving refreshAll before transitionToActive in adminService is same as triggering RMAdminCIi on standby node. This call throw StandByException and retried to active RM in RMAdminCli. When it comes to AdminService#transitionedToActive(), refreshing before rm.transitionedToActive throws an standby exception.
          Show
          rohithsharma Rohith Sharma K S added a comment - I had closer look at either of the solutions as above. One of the potential issue in both are Moving createAndInitService just before starting activeServices in transitionToActive. switch time will be impacted since every transitionToActive initializes active services. And RMWebApp has dependency on clienRMService for starting webapps. Without clientRMService initialization, RMWebapp can not be started. Moving refreshAll before transitionToActive in adminService is same as triggering RMAdminCIi on standby node. This call throw StandByException and retried to active RM in RMAdminCli. When it comes to AdminService#transitionedToActive(), refreshing before rm.transitionedToActive throws an standby exception.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          I think for any configuration issues while transitioningToActive, Adminservice should not allow JVM to continue. Because if AdminService throws any exception back to elector, elector again try to make RM active which goes in loop forever filling the logs.
          There could be 2 calls can lead to point of failures i.e first rm.transitionedToActive, second refreshAll().

          1. If any failures in rm.transitionedToActive then RM services will be stopped and RM will be in STANDBY state.
          2. If refreshAll() fails, BOTH RM will be in ACTIVE state as per this defect. Continuing RM services with invalid configuration does not good idea. Moreover invalid configurations should be notified to user immediately. So it would be better to make use of fail-fast configuration to exit the RM JVM. If this configuration is set to false , then call rm.handleTransitionToStandBy.
          Show
          rohithsharma Rohith Sharma K S added a comment - I think for any configuration issues while transitioningToActive, Adminservice should not allow JVM to continue. Because if AdminService throws any exception back to elector, elector again try to make RM active which goes in loop forever filling the logs. There could be 2 calls can lead to point of failures i.e first rm.transitionedToActive , second refreshAll() . If any failures in rm.transitionedToActive then RM services will be stopped and RM will be in STANDBY state. If refreshAll() fails, BOTH RM will be in ACTIVE state as per this defect. Continuing RM services with invalid configuration does not good idea. Moreover invalid configurations should be notified to user immediately. So it would be better to make use of fail-fast configuration to exit the RM JVM. If this configuration is set to false , then call rm.handleTransitionToStandBy .
          Hide
          varun_saxena Varun Saxena added a comment -

          Using fail fast makes sense.

          Show
          varun_saxena Varun Saxena added a comment - Using fail fast makes sense.
          Hide
          bibinchundatt Bibin A Chundatt added a comment -

          Hi Rohith Sharma K S and Sunil G

          Thanks for comments.

          1. So createAndInitActiveServices approach will not take

          Second approach sounds good with fail fast.
          I have updated the patch as per the suggestion. Please review

          Show
          bibinchundatt Bibin A Chundatt added a comment - Hi Rohith Sharma K S and Sunil G Thanks for comments. So createAndInitActiveServices approach will not take Second approach sounds good with fail fast. I have updated the patch as per the suggestion. Please review
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          Thanks Bibin A Chundatt for updating the patch. The patch mostly reasonable!!
          Some comments on the patch

          1. Does {{isRMActive() }} check is required..? If transitionedToActive is success only then refreshAll will be executed!! IAC if you add also then check should be common for both i.e _if_else
          2. In the Test, below code expecting transitionToActive to be failed? Is so, then it RM state shoud not be in Active state. Why RM will be in Active if adminService fails to transition?
            +    try {
            +      rm.adminService.transitionToActive(requestInfo);
            +    } catch (Exception e) {
            +      assertTrue("Error when transitioning to Active mode".contains(e
            +          .getMessage()));
            +    }
            +    assertEquals(HAServiceState.ACTIVE, rm.getRMContext().getHAServiceState());
            
          3. Have you verified the test locally? I have doubt that test may be exitted in the middle since you are changing the scheduler configuration. Scheduler configuration is loaded during transitionedToStandby which fails to load and System.exit is called.
          Show
          rohithsharma Rohith Sharma K S added a comment - Thanks Bibin A Chundatt for updating the patch. The patch mostly reasonable!! Some comments on the patch Does {{isRMActive() }} check is required..? If transitionedToActive is success only then refreshAll will be executed!! IAC if you add also then check should be common for both i.e _if_else In the Test, below code expecting transitionToActive to be failed? Is so, then it RM state shoud not be in Active state. Why RM will be in Active if adminService fails to transition? + try { + rm.adminService.transitionToActive(requestInfo); + } catch (Exception e) { + assertTrue( "Error when transitioning to Active mode" .contains(e + .getMessage())); + } + assertEquals(HAServiceState.ACTIVE, rm.getRMContext().getHAServiceState()); Have you verified the test locally? I have doubt that test may be exitted in the middle since you are changing the scheduler configuration. Scheduler configuration is loaded during transitionedToStandby which fails to load and System.exit is called.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          To be more clear on the 3rd point, handleTransitionToStandBy call will exit if transitionToStandby fails. This transition may fail because during transition, active services are initialized. CS initialization loads the new capacity-schduler conf which result in wrong default queue capacity value result standby transition failure.
          4. Instead of having separate class FatalEventCountDispatcher , can it be made inline?

          Show
          rohithsharma Rohith Sharma K S added a comment - To be more clear on the 3rd point, handleTransitionToStandBy call will exit if transitionToStandby fails. This transition may fail because during transition, active services are initialized. CS initialization loads the new capacity-schduler conf which result in wrong default queue capacity value result standby transition failure. 4. Instead of having separate class FatalEventCountDispatcher , can it be made inline?
          Hide
          varun_saxena Varun Saxena added a comment -

          Few additional comments :

          • Below exception block i.e. exception block after call to refreshAll, if YarnConfiguration.shouldRMFailFast(getConfig()) is true, we merely post fatal event and do not return or throw an exception. This would lead to success audit log for transition to active being printed, which doesn't quite look correct. Because we are encountering some problem during call to transition. We should either return or throw a ServiceFailedException here as well. Although both are OK because RM would anyways be down later but I would prefer exception.
             
            324	    } catch (Exception e) {
            325	      if (isRMActive() && YarnConfiguration.shouldRMFailFast(getConfig())) {
            326	        rmContext.getDispatcher().getEventHandler()
            327	            .handle(new RMFatalEvent(RMFatalEventType.ACTIVE_REFRESH_FAIL, e));
            328	      }else{
            329	        rm.handleTransitionToStandBy();
            330	        throw new ServiceFailedException(
            331	            "Error on refreshAll during transistion to Active", e);
            332	      }
            333	    }
            334	    RMAuditLogger.logSuccess(user.getShortUserName(), "transitionToActive",
            335	        "RMHAProtocolService");
            336	  }
            
          • In TestRMHA, below import is unused.
            	import io.netty.channel.MessageSizeEstimator.Handle;
            
          • A nit : There should be a space before else.
            328	      }else{
            329	        rm.handleTransitionToStandBy();
            
          • In the test added, assert is not required in the exception block after first call to transitionToActive
          • Maybe we can add an assert in test for service state being STANDBY after call to transitionToActive with incorrect capacity scheduler config and fail-fast being false.
          Show
          varun_saxena Varun Saxena added a comment - Few additional comments : Below exception block i.e. exception block after call to refreshAll, if YarnConfiguration.shouldRMFailFast(getConfig()) is true, we merely post fatal event and do not return or throw an exception. This would lead to success audit log for transition to active being printed, which doesn't quite look correct. Because we are encountering some problem during call to transition. We should either return or throw a ServiceFailedException here as well. Although both are OK because RM would anyways be down later but I would prefer exception. 324 } catch (Exception e) { 325 if (isRMActive() && YarnConfiguration.shouldRMFailFast(getConfig())) { 326 rmContext.getDispatcher().getEventHandler() 327 .handle( new RMFatalEvent(RMFatalEventType.ACTIVE_REFRESH_FAIL, e)); 328 } else { 329 rm.handleTransitionToStandBy(); 330 throw new ServiceFailedException( 331 "Error on refreshAll during transistion to Active" , e); 332 } 333 } 334 RMAuditLogger.logSuccess(user.getShortUserName(), "transitionToActive" , 335 "RMHAProtocolService" ); 336 } In TestRMHA, below import is unused. import io.netty.channel.MessageSizeEstimator.Handle; A nit : There should be a space before else. 328 } else { 329 rm.handleTransitionToStandBy(); In the test added, assert is not required in the exception block after first call to transitionToActive Maybe we can add an assert in test for service state being STANDBY after call to transitionToActive with incorrect capacity scheduler config and fail-fast being false.
          Hide
          varun_saxena Varun Saxena added a comment -

          Moreover, the fail fast configuration doesnt quite work as expected here. If capacity scheduler configuration is wrong, initialization will again fail and JVM will exit, which in essence is exactly same as the other case. We can handle fail fast as true case same way as earlier IMO.

          The reason it works in the test(JVM does not exit) is that you have passed CapacitySchedulerConfiguration object to MockRM. As CapacitySchedulerConfiguration is not instanceof YarnConfiguration, this will lead to a new YarnConfiguration object being created and passed to ResourceManager.
          When you are changing configuration in test and set queue capacity to 200, it is not reflecting in the Configuration object in ResourceManager class. That is why JVM does not exit when we transition to standby.

          Show
          varun_saxena Varun Saxena added a comment - Moreover, the fail fast configuration doesnt quite work as expected here. If capacity scheduler configuration is wrong, initialization will again fail and JVM will exit, which in essence is exactly same as the other case. We can handle fail fast as true case same way as earlier IMO. The reason it works in the test(JVM does not exit) is that you have passed CapacitySchedulerConfiguration object to MockRM. As CapacitySchedulerConfiguration is not instanceof YarnConfiguration, this will lead to a new YarnConfiguration object being created and passed to ResourceManager. When you are changing configuration in test and set queue capacity to 200, it is not reflecting in the Configuration object in ResourceManager class. That is why JVM does not exit when we transition to standby.
          Hide
          varun_saxena Varun Saxena added a comment -

          Sorry I meant we can handle fail fast config being false case same way as we were doing in earlier patches. Otherwise checking for fail fast doesnt make any difference because both the code paths lead to same result.

          Show
          varun_saxena Varun Saxena added a comment - Sorry I meant we can handle fail fast config being false case same way as we were doing in earlier patches. Otherwise checking for fail fast doesnt make any difference because both the code paths lead to same result.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          There are 2 type of refresh can happen i.e. 1. yarn-site.xml refresh, 2. scheduler configurations refresh. Schduler configurations are reloaded for every service initialization which is by design. If any issue in the scheduler configuration, fail-fast configuraton behavior work as same for both true and false. Fail-fast configuration is useful when admin do mistake in configuring mistake in yarn-site.xml. With wrong configuration in yarn-site.xml, RM service can be up whereas with wrong Scheduler configuration , service can NOT be up at all. On best effort basis for make service up, handling exception for yarn-site.xml and scheduler configuration are different.

          BTW, making RM state StandBy would lead to filling up of the logs very soon because of elector continuous try to make active. Any configuration issue, better to exit the JVM and notify admin that RM is down so that admin can check the logs and identify it.

          Show
          rohithsharma Rohith Sharma K S added a comment - There are 2 type of refresh can happen i.e. 1. yarn-site.xml refresh, 2. scheduler configurations refresh. Schduler configurations are reloaded for every service initialization which is by design. If any issue in the scheduler configuration, fail-fast configuraton behavior work as same for both true and false. Fail-fast configuration is useful when admin do mistake in configuring mistake in yarn-site.xml. With wrong configuration in yarn-site.xml, RM service can be up whereas with wrong Scheduler configuration , service can NOT be up at all. On best effort basis for make service up , handling exception for yarn-site.xml and scheduler configuration are different. BTW, making RM state StandBy would lead to filling up of the logs very soon because of elector continuous try to make active. Any configuration issue, better to exit the JVM and notify admin that RM is down so that admin can check the logs and identify it.
          Hide
          varun_saxena Varun Saxena added a comment -

          Hmm...my point of view based on the fact that the service cannot be up if atleast one RM is not active. Standby RM is not going to serve anything anyways.
          Till configurations of this RM are not corrected, whether yarn-site or scheduler configurations, this RM anyways cant become active (refreshAll will always fail). And you can say there might be some silly mistake in scheduler configuration too.

          What we were doing before in the patch wont fill up the logs if configuration is ok on other RM. And if its not Ok on other RM, logs will fill up even even if refreshAll fails because of something other than scheduler config(and fail fast is false).
          fail fast by default is true, and if admin is making it false, he will know what to expect.

          But, you can say a RM shutting down is a far more alarming thing for an admin and scheduler configurations more important. I agree with that. Maybe we can make RM with wrong configuration down at all times. Because till he correct the config(whether yarn-site or scheduler config), this RM cant become active.

          Let us take opinion of couple of others as well on this. We can do whatever is the consensus.

          Show
          varun_saxena Varun Saxena added a comment - Hmm...my point of view based on the fact that the service cannot be up if atleast one RM is not active. Standby RM is not going to serve anything anyways. Till configurations of this RM are not corrected, whether yarn-site or scheduler configurations, this RM anyways cant become active (refreshAll will always fail). And you can say there might be some silly mistake in scheduler configuration too. What we were doing before in the patch wont fill up the logs if configuration is ok on other RM. And if its not Ok on other RM, logs will fill up even even if refreshAll fails because of something other than scheduler config(and fail fast is false). fail fast by default is true, and if admin is making it false, he will know what to expect. But, you can say a RM shutting down is a far more alarming thing for an admin and scheduler configurations more important. I agree with that. Maybe we can make RM with wrong configuration down at all times. Because till he correct the config(whether yarn-site or scheduler config), this RM cant become active. Let us take opinion of couple of others as well on this. We can do whatever is the consensus.
          Hide
          varun_saxena Varun Saxena added a comment -

          In previous patches, we were delaying reinitialization till attempting transition to active again and not attempting it immediately as we have done here. Any issues you expect with that ?

          Show
          varun_saxena Varun Saxena added a comment - In previous patches, we were delaying reinitialization till attempting transition to active again and not attempting it immediately as we have done here. Any issues you expect with that ?
          Hide
          varun_saxena Varun Saxena added a comment -

          Saw your comments above. We cant do what we were doing earlier because as you say WebApp should be up even in standby. Let me think if something else can be done.

          Show
          varun_saxena Varun Saxena added a comment - Saw your comments above. We cant do what we were doing earlier because as you say WebApp should be up even in standby. Let me think if something else can be done.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          Hi Varun Saxena, trying to understand your point of statement, my suggestion is to exit the RM if any configuration issue during refreshAll during AdminService#transitionToActive. As I given reason for making RM JVM down rather than keeping JVM alive in earlier comment, do you have any concern for exiting the RM for configuration issues?

          Show
          rohithsharma Rohith Sharma K S added a comment - Hi Varun Saxena , trying to understand your point of statement, my suggestion is to exit the RM if any configuration issue during refreshAll during AdminService#transitionToActive . As I given reason for making RM JVM down rather than keeping JVM alive in earlier comment , do you have any concern for exiting the RM for configuration issues?
          Hide
          varun_saxena Varun Saxena added a comment -

          I do not have any concern for exiting JVM. If fail fast is true(default behavior), JVM will exit anyways.

          I was wondering if it would be semantically appropriate to make JVM exit in some cases if somebody has explicitly changed the fail fast config to false. Logs can fill up if yarn-site.xml is wrong on both RMs' too.

          I am not sure about the webapp part though. Does it require client rm service to be initialized ? AFAIK, if RM is standby it will hit the webapp filter and redirect to other RM(which may be active). Haven't tested UI after applying previous patches, so maybe Bibin can tell. If there are some issues with webapp, we will have to exit the JVM if transition to standby fails. Because there may be no other way out then.
          I will discuss further on this with you offline.

          Show
          varun_saxena Varun Saxena added a comment - I do not have any concern for exiting JVM. If fail fast is true(default behavior), JVM will exit anyways. I was wondering if it would be semantically appropriate to make JVM exit in some cases if somebody has explicitly changed the fail fast config to false. Logs can fill up if yarn-site.xml is wrong on both RMs' too. I am not sure about the webapp part though. Does it require client rm service to be initialized ? AFAIK, if RM is standby it will hit the webapp filter and redirect to other RM(which may be active). Haven't tested UI after applying previous patches, so maybe Bibin can tell. If there are some issues with webapp, we will have to exit the JVM if transition to standby fails. Because there may be no other way out then. I will discuss further on this with you offline.
          Hide
          varun_saxena Varun Saxena added a comment -

          Infact according to me, we can crash RM on all times if config is wrong. Because till config is corrected, the RM where config is wrong cannot become active(and hence will be unusable). In that case, fail fast config wont even be required. So should we change the behavior to keep RM in standby(but up) if fail fast is set to false ? Anyways can discuss more in detail face to face.

          Show
          varun_saxena Varun Saxena added a comment - Infact according to me, we can crash RM on all times if config is wrong. Because till config is corrected, the RM where config is wrong cannot become active(and hence will be unusable). In that case, fail fast config wont even be required. So should we change the behavior to keep RM in standby(but up) if fail fast is set to false ? Anyways can discuss more in detail face to face.
          Hide
          sunilg Sunil G added a comment -

          As I see this, JVM exit is reasonable as proposed by Rohith earlier. Because scheduler configurations are wrong mostly, and its not required to switch to standby or fail-fast etc. Directly if we can exit JVM, it will be clean and there will be enough information available in logs to analyze for config fail reasons.

          Show
          sunilg Sunil G added a comment - As I see this, JVM exit is reasonable as proposed by Rohith earlier. Because scheduler configurations are wrong mostly, and its not required to switch to standby or fail-fast etc. Directly if we can exit JVM, it will be clean and there will be enough information available in logs to analyze for config fail reasons.
          Hide
          varun_saxena Varun Saxena added a comment -

          Yes I agree. We can exit JVM directly. No need of using fail fast.

          Show
          varun_saxena Varun Saxena added a comment - Yes I agree. We can exit JVM directly. No need of using fail fast.
          Hide
          bibinchundatt Bibin A Chundatt added a comment -

          So JVM exit is the conclusion after discussion.
          Attaching patch based on the same

          Show
          bibinchundatt Bibin A Chundatt added a comment - So JVM exit is the conclusion after discussion. Attaching patch based on the same
          Hide
          bibinchundatt Bibin A Chundatt added a comment -

          Missed one comment isRMActive check is not required.Attaching patch again

          Show
          bibinchundatt Bibin A Chundatt added a comment - Missed one comment isRMActive check is not required.Attaching patch again
          Hide
          varun_saxena Varun Saxena added a comment -

          The latest patch, 0008-YARN-3893.patch LGTM.
          +1 pending Jenkins.

          Show
          varun_saxena Varun Saxena added a comment - The latest patch, 0008- YARN-3893 .patch LGTM. +1 pending Jenkins.
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 16m 15s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
          +1 javac 7m 41s There were no new javac warning messages.
          +1 javadoc 9m 59s There were no new javadoc warning messages.
          +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings.
          +1 checkstyle 0m 50s There were no new checkstyle issues.
          +1 whitespace 0m 0s The patch has no lines that end in whitespace.
          +1 install 1m 27s mvn install still works.
          +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse.
          +1 findbugs 1m 32s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          -1 yarn tests 52m 9s Tests failed in hadoop-yarn-server-resourcemanager.
              90m 50s  



          Reason Tests
          Failed unit tests hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12752428/0006-YARN-3893.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / a4d9acc
          hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8914/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8914/testReport/
          Java 1.7.0_55
          uname Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/8914/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 16m 15s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. +1 javac 7m 41s There were no new javac warning messages. +1 javadoc 9m 59s There were no new javadoc warning messages. +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 0m 50s There were no new checkstyle issues. +1 whitespace 0m 0s The patch has no lines that end in whitespace. +1 install 1m 27s mvn install still works. +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse. +1 findbugs 1m 32s The patch does not introduce any new Findbugs (version 3.0.0) warnings. -1 yarn tests 52m 9s Tests failed in hadoop-yarn-server-resourcemanager.     90m 50s   Reason Tests Failed unit tests hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12752428/0006-YARN-3893.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / a4d9acc hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8914/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8914/testReport/ Java 1.7.0_55 uname Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8914/console This message was automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 16m 32s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
          +1 javac 7m 46s There were no new javac warning messages.
          +1 javadoc 9m 48s There were no new javadoc warning messages.
          +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings.
          +1 checkstyle 0m 51s There were no new checkstyle issues.
          +1 whitespace 0m 0s The patch has no lines that end in whitespace.
          +1 install 1m 29s mvn install still works.
          +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse.
          +1 findbugs 1m 29s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          -1 yarn tests 53m 19s Tests failed in hadoop-yarn-server-resourcemanager.
              92m 14s  



          Reason Tests
          Failed unit tests hadoop.yarn.server.resourcemanager.security.TestClientToAMTokens
            hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService
            hadoop.yarn.server.resourcemanager.TestClientRMService



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12752434/0007-YARN-3893.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / a4d9acc
          hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8915/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8915/testReport/
          Java 1.7.0_55
          uname Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/8915/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 16m 32s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. +1 javac 7m 46s There were no new javac warning messages. +1 javadoc 9m 48s There were no new javadoc warning messages. +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 0m 51s There were no new checkstyle issues. +1 whitespace 0m 0s The patch has no lines that end in whitespace. +1 install 1m 29s mvn install still works. +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse. +1 findbugs 1m 29s The patch does not introduce any new Findbugs (version 3.0.0) warnings. -1 yarn tests 53m 19s Tests failed in hadoop-yarn-server-resourcemanager.     92m 14s   Reason Tests Failed unit tests hadoop.yarn.server.resourcemanager.security.TestClientToAMTokens   hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService   hadoop.yarn.server.resourcemanager.TestClientRMService Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12752434/0007-YARN-3893.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / a4d9acc hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8915/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8915/testReport/ Java 1.7.0_55 uname Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8915/console This message was automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -



          +1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 16m 43s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
          +1 javac 7m 55s There were no new javac warning messages.
          +1 javadoc 10m 8s There were no new javadoc warning messages.
          +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings.
          +1 checkstyle 0m 51s There were no new checkstyle issues.
          +1 whitespace 0m 0s The patch has no lines that end in whitespace.
          +1 install 1m 31s mvn install still works.
          +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse.
          +1 findbugs 1m 31s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          +1 yarn tests 53m 39s Tests passed in hadoop-yarn-server-resourcemanager.
              93m 18s  



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12752437/0008-YARN-3893.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / a4d9acc
          hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8916/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8916/testReport/
          Java 1.7.0_55
          uname Linux asf908.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/8916/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 pre-patch 16m 43s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. +1 javac 7m 55s There were no new javac warning messages. +1 javadoc 10m 8s There were no new javadoc warning messages. +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 0m 51s There were no new checkstyle issues. +1 whitespace 0m 0s The patch has no lines that end in whitespace. +1 install 1m 31s mvn install still works. +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse. +1 findbugs 1m 31s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 yarn tests 53m 39s Tests passed in hadoop-yarn-server-resourcemanager.     93m 18s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12752437/0008-YARN-3893.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / a4d9acc hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8916/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8916/testReport/ Java 1.7.0_55 uname Linux asf908.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8916/console This message was automatically generated.
          Hide
          bibinchundatt Bibin A Chundatt added a comment -

          Test failures are not related to this patch. Have looked into the failed testcases

          hadoop.yarn.server.resourcemanager.security.TestClientToAMTokens - Due Bind exception
          hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService - Locally verified its working fine and success
          hadoop.yarn.server.resourcemanager.TestClientRMService -Ran locally in eclipse its working fine

          Show
          bibinchundatt Bibin A Chundatt added a comment - Test failures are not related to this patch. Have looked into the failed testcases hadoop.yarn.server.resourcemanager.security.TestClientToAMTokens - Due Bind exception hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService - Locally verified its working fine and success hadoop.yarn.server.resourcemanager.TestClientRMService -Ran locally in eclipse its working fine
          Show
          bibinchundatt Bibin A Chundatt added a comment - Above comments are for https://builds.apache.org/job/PreCommit-YARN-Build/8915/testReport/
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Hi Bibin A Chundatt,
          Thanks for the patch, test cases ran fine, approach and test case seems to be fine but few comments from my side

          1. timeout of 900000 is on the higher side is that much req or was it for local testing ?
          2. instead of test case in RMHA can we think of adding it to TestRMAdminService as the failure is related to transition to Active ?
          3. May be while throwing RMFatalEvent better to wrap it with another exception wrapping the existing one and with the message that transition to active failed so that RM Logs have clear information on what operation it exited. or may be eventType instead of having ACTIVE_REFRESH_FAIL we can have more intuitive name TRANSITION_TO_ACTIVE_FAILED
          Show
          Naganarasimha Naganarasimha G R added a comment - Hi Bibin A Chundatt , Thanks for the patch, test cases ran fine, approach and test case seems to be fine but few comments from my side timeout of 900000 is on the higher side is that much req or was it for local testing ? instead of test case in RMHA can we think of adding it to TestRMAdminService as the failure is related to transition to Active ? May be while throwing RMFatalEvent better to wrap it with another exception wrapping the existing one and with the message that transition to active failed so that RM Logs have clear information on what operation it exited. or may be eventType instead of having ACTIVE_REFRESH_FAIL we can have more intuitive name TRANSITION_TO_ACTIVE_FAILED
          Hide
          bibinchundatt Bibin A Chundatt added a comment -

          Hi Naga

          Thnks for looking into patch

          timeout of 900000 is on the higher side is that much req or was it for local testing ?

          will update the same.

          instead of test case in RMHA can we think of adding it to TestRMAdminService as the failure is related to transition to Active ?

          As i understand all transistiontoActive & HA related testcases are added in same class.

          3.TRANSITION_TO_ACTIVE_FAILED is not actually failing its refreshAll rt? Thts the reason it gave specific name.

          Points 2 and 3 are not mandatory fix items rt?

          Show
          bibinchundatt Bibin A Chundatt added a comment - Hi Naga Thnks for looking into patch timeout of 900000 is on the higher side is that much req or was it for local testing ? will update the same. instead of test case in RMHA can we think of adding it to TestRMAdminService as the failure is related to transition to Active ? As i understand all transistiontoActive & HA related testcases are added in same class. 3. TRANSITION_TO_ACTIVE_FAILED is not actually failing its refreshAll rt? Thts the reason it gave specific name. Points 2 and 3 are not mandatory fix items rt?
          Hide
          bibinchundatt Bibin A Chundatt added a comment -

          Attaching patch after handling comments.

          1. timeout updated in testcase
          2. Changed from ACTIVE_REFRESH_FAIL to TRANSITION_TO_ACTIVE_FAILED
          Show
          bibinchundatt Bibin A Chundatt added a comment - Attaching patch after handling comments. timeout updated in testcase Changed from ACTIVE_REFRESH_FAIL to TRANSITION_TO_ACTIVE_FAILED
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Hi Bibin A Chundatt
          2> there are test cases related to transition in TestRMAdminService.testRMHAWithFileSystemBasedConfiguration but most of it is present in TestRMHA so i think it should be fine.

          3> Well IMHO it would be better be handled in the later approach i suggested, as refreshAll is just a private method but actual operation is transistionToActive which Failed which is more readable than ACTIVE_REFRESH_FAIL

          Show
          Naganarasimha Naganarasimha G R added a comment - Hi Bibin A Chundatt 2> there are test cases related to transition in TestRMAdminService.testRMHAWithFileSystemBasedConfiguration but most of it is present in TestRMHA so i think it should be fine. 3> Well IMHO it would be better be handled in the later approach i suggested, as refreshAll is just a private method but actual operation is transistionToActive which Failed which is more readable than ACTIVE_REFRESH_FAIL
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Oops saw this message late !

          Show
          Naganarasimha Naganarasimha G R added a comment - Oops saw this message late !
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Oops saw this message late !

          Show
          Naganarasimha Naganarasimha G R added a comment - Oops saw this message late !
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 16m 26s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
          +1 javac 7m 42s There were no new javac warning messages.
          +1 javadoc 10m 2s There were no new javadoc warning messages.
          +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings.
          +1 checkstyle 0m 51s There were no new checkstyle issues.
          +1 whitespace 0m 0s The patch has no lines that end in whitespace.
          +1 install 1m 30s mvn install still works.
          +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse.
          +1 findbugs 1m 32s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          -1 yarn tests 53m 42s Tests failed in hadoop-yarn-server-resourcemanager.
              92m 44s  



          Reason Tests
          Failed unit tests hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService
            hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesHttpStaticUserPermissions



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12752740/0010-YARN-3893.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / 0bf2854
          hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8929/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8929/testReport/
          Java 1.7.0_55
          uname Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/8929/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 16m 26s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. +1 javac 7m 42s There were no new javac warning messages. +1 javadoc 10m 2s There were no new javadoc warning messages. +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 0m 51s There were no new checkstyle issues. +1 whitespace 0m 0s The patch has no lines that end in whitespace. +1 install 1m 30s mvn install still works. +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse. +1 findbugs 1m 32s The patch does not introduce any new Findbugs (version 3.0.0) warnings. -1 yarn tests 53m 42s Tests failed in hadoop-yarn-server-resourcemanager.     92m 44s   Reason Tests Failed unit tests hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService   hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesHttpStaticUserPermissions Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12752740/0010-YARN-3893.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 0bf2854 hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8929/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8929/testReport/ Java 1.7.0_55 uname Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8929/console This message was automatically generated.
          Hide
          bibinchundatt Bibin A Chundatt added a comment -

          Testcase failures are not related to this patch.

          Show
          bibinchundatt Bibin A Chundatt added a comment - Testcase failures are not related to this patch.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          +1 lgtm.. Will commit it tomorrow if there is no objections/comments from other folks..

          Show
          rohithsharma Rohith Sharma K S added a comment - +1 lgtm.. Will commit it tomorrow if there is no objections/comments from other folks..
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          committed to branch-2.7.2, branch-2 and trunk.. thanks Bibin!!

          Show
          rohithsharma Rohith Sharma K S added a comment - committed to branch-2.7.2, branch-2 and trunk.. thanks Bibin!!
          Hide
          varun_saxena Varun Saxena added a comment -

          +1...lgtm too

          Show
          varun_saxena Varun Saxena added a comment - +1...lgtm too
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #8387 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8387/)
          YARN-3893. Both RM in active state when Admin#transitionToActive failure from refeshAll() (Bibin A Chundatt via rohithsharmaks) (rohithsharmaks: rev 7d6687fe76f6152a577ff2298c358dd30fce41fb)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #8387 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8387/ ) YARN-3893 . Both RM in active state when Admin#transitionToActive failure from refeshAll() (Bibin A Chundatt via rohithsharmaks) (rohithsharmaks: rev 7d6687fe76f6152a577ff2298c358dd30fce41fb) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk #1070 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/1070/)
          YARN-3893. Both RM in active state when Admin#transitionToActive failure from refeshAll() (Bibin A Chundatt via rohithsharmaks) (rohithsharmaks: rev 7d6687fe76f6152a577ff2298c358dd30fce41fb)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #1070 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/1070/ ) YARN-3893 . Both RM in active state when Admin#transitionToActive failure from refeshAll() (Bibin A Chundatt via rohithsharmaks) (rohithsharmaks: rev 7d6687fe76f6152a577ff2298c358dd30fce41fb) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #342 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/342/)
          YARN-3893. Both RM in active state when Admin#transitionToActive failure from refeshAll() (Bibin A Chundatt via rohithsharmaks) (rohithsharmaks: rev 7d6687fe76f6152a577ff2298c358dd30fce41fb)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #342 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/342/ ) YARN-3893 . Both RM in active state when Admin#transitionToActive failure from refeshAll() (Bibin A Chundatt via rohithsharmaks) (rohithsharmaks: rev 7d6687fe76f6152a577ff2298c358dd30fce41fb) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #335 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/335/)
          YARN-3893. Both RM in active state when Admin#transitionToActive failure from refeshAll() (Bibin A Chundatt via rohithsharmaks) (rohithsharmaks: rev 7d6687fe76f6152a577ff2298c358dd30fce41fb)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #335 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/335/ ) YARN-3893 . Both RM in active state when Admin#transitionToActive failure from refeshAll() (Bibin A Chundatt via rohithsharmaks) (rohithsharmaks: rev 7d6687fe76f6152a577ff2298c358dd30fce41fb) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #2264 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2264/)
          YARN-3893. Both RM in active state when Admin#transitionToActive failure from refeshAll() (Bibin A Chundatt via rohithsharmaks) (rohithsharmaks: rev 7d6687fe76f6152a577ff2298c358dd30fce41fb)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #2264 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2264/ ) YARN-3893 . Both RM in active state when Admin#transitionToActive failure from refeshAll() (Bibin A Chundatt via rohithsharmaks) (rohithsharmaks: rev 7d6687fe76f6152a577ff2298c358dd30fce41fb) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #325 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/325/)
          YARN-3893. Both RM in active state when Admin#transitionToActive failure from refeshAll() (Bibin A Chundatt via rohithsharmaks) (rohithsharmaks: rev 7d6687fe76f6152a577ff2298c358dd30fce41fb)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #325 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/325/ ) YARN-3893 . Both RM in active state when Admin#transitionToActive failure from refeshAll() (Bibin A Chundatt via rohithsharmaks) (rohithsharmaks: rev 7d6687fe76f6152a577ff2298c358dd30fce41fb) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2284 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2284/)
          YARN-3893. Both RM in active state when Admin#transitionToActive failure from refeshAll() (Bibin A Chundatt via rohithsharmaks) (rohithsharmaks: rev 7d6687fe76f6152a577ff2298c358dd30fce41fb)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2284 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2284/ ) YARN-3893 . Both RM in active state when Admin#transitionToActive failure from refeshAll() (Bibin A Chundatt via rohithsharmaks) (rohithsharmaks: rev 7d6687fe76f6152a577ff2298c358dd30fce41fb) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java
          Hide
          djp Junping Du added a comment -

          Hi Bibin A Chundatt, Rohith Sharma K S and Xuan Gong, is this bug also valid on branch-2.6? If so, may be we should consider to backport to branch-2.6?

          Show
          djp Junping Du added a comment - Hi Bibin A Chundatt , Rohith Sharma K S and Xuan Gong , is this bug also valid on branch-2.6? If so, may be we should consider to backport to branch-2.6?
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          I think it should be there in 2.6 too, let me cross confirm it. If exist, I will backport this to 2.6

          Show
          rohithsharma Rohith Sharma K S added a comment - I think it should be there in 2.6 too, let me cross confirm it. If exist, I will backport this to 2.6
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          This issue is valid for branch-2.6. I have back ported to 2.6.4

          Show
          rohithsharma Rohith Sharma K S added a comment - This issue is valid for branch-2.6. I have back ported to 2.6.4
          Hide
          djp Junping Du added a comment -
          Show
          djp Junping Du added a comment - Thanks Rohith Sharma K S !
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #9060 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9060/)
          Add YARN-2975, YARN-3893, YARN-2902 and YARN-4354 to Release 2.6.4 entry (junping_du: rev b6c9d3fab9c76b03abd664858f64a4ebf3c2bb20)

          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #9060 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9060/ ) Add YARN-2975 , YARN-3893 , YARN-2902 and YARN-4354 to Release 2.6.4 entry (junping_du: rev b6c9d3fab9c76b03abd664858f64a4ebf3c2bb20) hadoop-yarn-project/CHANGES.txt

            People

            • Assignee:
              bibinchundatt Bibin A Chundatt
              Reporter:
              bibinchundatt Bibin A Chundatt
            • Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development