Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: graceful
    • Labels:
      None
    • Target Version/s:

      Description

      New CLI (or existing CLI with parameters) should put each node on decommission list to decommissioning status and track timeout to terminate the nodes that haven't get finished.

      1. YARN-3225.patch
        21 kB
        Devaraj K
      2. YARN-3225-1.patch
        79 kB
        Devaraj K
      3. YARN-3225-2.patch
        64 kB
        Devaraj K
      4. YARN-3225-3.patch
        64 kB
        Devaraj K
      5. YARN-3225-4.patch
        65 kB
        Devaraj K
      6. YARN-3225-5.patch
        65 kB
        Devaraj K
      7. YARN-914.patch
        19 kB
        Devaraj K

        Issue Links

          Activity

          Hide
          sunilg Sunil G added a comment -

          Hi Junping Du
          To understand the idea correctly, do you mean a command is to be added so that a given node can be made as decommissioned. And it can be given a timeout to gracefully verify the same is done.
          So something like ./yarn -node <nodeID> -timeout 200 -decommission

          Pls help to clarify, and I would like pursue this if you are not assigning yourself.

          Show
          sunilg Sunil G added a comment - Hi Junping Du To understand the idea correctly, do you mean a command is to be added so that a given node can be made as decommissioned. And it can be given a timeout to gracefully verify the same is done. So something like ./yarn -node <nodeID> -timeout 200 -decommission Pls help to clarify, and I would like pursue this if you are not assigning yourself.
          Hide
          sunilg Sunil G added a comment -

          Sorry, I slightly misunderstood earlier.
          You meant rmadmin command with a new option such as -g. So one doubt here, can timeout also be passed here?

          Show
          sunilg Sunil G added a comment - Sorry, I slightly misunderstood earlier. You meant rmadmin command with a new option such as -g. So one doubt here, can timeout also be passed here?
          Hide
          djp Junping Du added a comment -

          Thanks Sunil G for the comments! Yes. I mean mradmin command line. I think it could be better to pass a timeout with adding a parameter something like "-t". Without this parameter, it will decommission node forcefully just like the old. Thoughts?

          Show
          djp Junping Du added a comment - Thanks Sunil G for the comments! Yes. I mean mradmin command line. I think it could be better to pass a timeout with adding a parameter something like "-t". Without this parameter, it will decommission node forcefully just like the old. Thoughts?
          Hide
          devaraj.k Devaraj K added a comment -

          What would be the timeout units here, are we thinking of any constrained range for timeout value? Thanks

          Show
          devaraj.k Devaraj K added a comment - What would be the timeout units here, are we thinking of any constrained range for timeout value? Thanks
          Hide
          djp Junping Du added a comment -

          Hi Devaraj K, thanks for taking on this. I think seconds here should be fine as it is usually take minutes on the timeout.

          Show
          djp Junping Du added a comment - Hi Devaraj K , thanks for taking on this. I think seconds here should be fine as it is usually take minutes on the timeout.
          Hide
          sunilg Sunil G added a comment -

          Another point is, suppose if we fire same command with different time units immediately. And if first timeout is still ongoing, do we need to update timeout?

          Show
          sunilg Sunil G added a comment - Another point is, suppose if we fire same command with different time units immediately. And if first timeout is still ongoing, do we need to update timeout?
          Hide
          devaraj.k Devaraj K added a comment -

          I see the same mentioned in the design doc https://issues.apache.org/jira/secure/attachment/12699496/GracefullyDecommissionofNodeManagerv3.pdf

          Before NMs get decommissioned, the timeout can be updated to shorter or
          longer. e.g. admin can terminate the CLI and resubmit it with a different timeout
          value.

          Show
          devaraj.k Devaraj K added a comment - I see the same mentioned in the design doc https://issues.apache.org/jira/secure/attachment/12699496/GracefullyDecommissionofNodeManagerv3.pdf Before NMs get decommissioned, the timeout can be updated to shorter or longer. e.g. admin can terminate the CLI and resubmit it with a different timeout value.
          Hide
          devaraj.k Devaraj K added a comment -

          Thanks Junping Du for clarification.

          Show
          devaraj.k Devaraj K added a comment - Thanks Junping Du for clarification.
          Hide
          sunilg Sunil G added a comment -

          Yes Devaraj K. Thank you for the clarification.

          Show
          sunilg Sunil G added a comment - Yes Devaraj K . Thank you for the clarification.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12703395/YARN-914.patch
          against trunk revision 5578e22.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.yarn.api.TestPBImplRecords
          org.apache.hadoop.yarn.server.resourcemanager.webapp.TestNodesPage
          org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6893//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6893//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6893//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12703395/YARN-914.patch against trunk revision 5578e22. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.api.TestPBImplRecords org.apache.hadoop.yarn.server.resourcemanager.webapp.TestNodesPage org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6893//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6893//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6893//console This message is automatically generated.
          Hide
          djp Junping Du added a comment -

          Thanks Devaraj K for delivering the patch which is the first one in graceful decommission effort!
          A couple of comments:
          In RefreshNodesRequestPBImpl.java,

             @Override
          +  public long getTimeout() {
          +    return getProto().getTimeout();
          +  }
          +
          +  @Override
          +  public void setTimeout(long timeout) {
          +    builder.setTimeout(timeout);
          +  }
          

          The setTimeout() has problem because we didn't set viaProto to false, so if we getTimeout() afterwards then it will return the old value from old proto. Suggest to add a method of maybeInitBuilder() just like other PBImpls, also add a unit test to verify the PBImpl works as expected.

          In NodeState.java,

          DECOMMISSION_IN_PROGRESS
          

          Jason Lowe suggested in umbrella JIRA that it is better to be DECOMMISSIONING. I had the same feeling so reflect the name in latest proposal. Do you think we should incorporate that comments here?

          In RMAdminCLI.java,

          +          .put("-refreshNodes", new UsageInfo("[-g [timeout in ms]]",
                         "Refresh the hosts information at the ResourceManager."))
          

          I think we should add more info to description message - "Refresh the hosts information at the ResourceManager." to explain -g option doing. Isn't it? Also, per my suggestion above, it is better to specify seconds in timeout. MS is more precisely, but get more chance for wrong (manually) operation.

          Also, it is better to change the patch name to be consist with JIRA number (YARN-3225).

          Show
          djp Junping Du added a comment - Thanks Devaraj K for delivering the patch which is the first one in graceful decommission effort! A couple of comments: In RefreshNodesRequestPBImpl.java, @Override + public long getTimeout() { + return getProto().getTimeout(); + } + + @Override + public void setTimeout( long timeout) { + builder.setTimeout(timeout); + } The setTimeout() has problem because we didn't set viaProto to false, so if we getTimeout() afterwards then it will return the old value from old proto. Suggest to add a method of maybeInitBuilder() just like other PBImpls, also add a unit test to verify the PBImpl works as expected. In NodeState.java, DECOMMISSION_IN_PROGRESS Jason Lowe suggested in umbrella JIRA that it is better to be DECOMMISSIONING. I had the same feeling so reflect the name in latest proposal. Do you think we should incorporate that comments here? In RMAdminCLI.java, + .put( "-refreshNodes" , new UsageInfo( "[-g [timeout in ms]]" , "Refresh the hosts information at the ResourceManager." )) I think we should add more info to description message - "Refresh the hosts information at the ResourceManager." to explain -g option doing. Isn't it? Also, per my suggestion above, it is better to specify seconds in timeout. MS is more precisely, but get more chance for wrong (manually) operation. Also, it is better to change the patch name to be consist with JIRA number ( YARN-3225 ).
          Hide
          devaraj.k Devaraj K added a comment -

          Updated the patch with tests correction.

          Show
          devaraj.k Devaraj K added a comment - Updated the patch with tests correction.
          Hide
          devaraj.k Devaraj K added a comment -

          Thanks Junping Du for your quick review and comments.

          I did not see your comments before updating patch. I will fix comments and upload another patch.

          Show
          devaraj.k Devaraj K added a comment - Thanks Junping Du for your quick review and comments. I did not see your comments before updating patch. I will fix comments and upload another patch.
          Hide
          djp Junping Du added a comment -

          Thanks Devaraj K. Sorry for missing an important comment on current patch:
          From many discussions in umbrella JIRA (YARN-914), most of us prefer not to pass timeout value to YARN RM side but make it proper handled in RMAdmin side which make things much simpler (please check discussion there for more details). So the CLI with -g option could be a blocked CLI until all nodes get decommissioned or timeout.
          Psudo logic in this CLI should be something like:

          refreshNodeGracefully(); - mark node to decommissionning
          while (!timeout && some nodes still in decommissioning) {
            checkStatusForDecommissioningNodes();
          }
          if (timeout) {
            refreshNode forcefully for reminding nodes
          }
          

          Thoughts?

          Show
          djp Junping Du added a comment - Thanks Devaraj K . Sorry for missing an important comment on current patch: From many discussions in umbrella JIRA ( YARN-914 ), most of us prefer not to pass timeout value to YARN RM side but make it proper handled in RMAdmin side which make things much simpler (please check discussion there for more details). So the CLI with -g option could be a blocked CLI until all nodes get decommissioned or timeout. Psudo logic in this CLI should be something like: refreshNodeGracefully(); - mark node to decommissionning while (!timeout && some nodes still in decommissioning) { checkStatusForDecommissioningNodes(); } if (timeout) { refreshNode forcefully for reminding nodes } Thoughts?
          Hide
          devaraj.k Devaraj K added a comment -

          not to pass timeout value to YARN RM side but make it proper handled in RMAdmin side which make things much simpler

          Here what would happen to the decommissioning node if the RMAdmin issued refreshNodeGracefully() and gets terminated(exited) before issuing the 'refreshNode forcefully'? This can be done by doing Ctrl+C on the command prompt. The Node will be in decommissioning state forever and becomes unusable for new containers allocation.

          Show
          devaraj.k Devaraj K added a comment - not to pass timeout value to YARN RM side but make it proper handled in RMAdmin side which make things much simpler Here what would happen to the decommissioning node if the RMAdmin issued refreshNodeGracefully() and gets terminated(exited) before issuing the 'refreshNode forcefully'? This can be done by doing Ctrl+C on the command prompt. The Node will be in decommissioning state forever and becomes unusable for new containers allocation.
          Hide
          djp Junping Du added a comment -

          Here what would happen to the decommissioning node if the RMAdmin issued refreshNodeGracefully() and gets terminated(exited) before issuing the 'refreshNode forcefully'? This can be done by doing Ctrl+C on the command prompt. The Node will be in decommissioning state forever and becomes unusable for new containers allocation.

          From v3 version of proposal in umbrella JIRA, "If CLI get interrupted, then it won’t keep track of timeout to forcefully decommissioned left nodes. However, nodes in “DECOMMISSIONING” will still get terminated later (after running apps get finished) except admin call recommission CLI on these nodes explicitly." The node in decommissioning state are terminated for 2 trigger events, one is timeout and the other is all application (on that node) get finished (will be covered in YARN-3212). We can document what's that means if user do Ctrl +C to gracefully decommissioning CLI. If application (like LRS) never ends in this situation, then user need to refresh these node forcefully (or gracefully with timeout but not interrupt). Make sense?

          Show
          djp Junping Du added a comment - Here what would happen to the decommissioning node if the RMAdmin issued refreshNodeGracefully() and gets terminated(exited) before issuing the 'refreshNode forcefully'? This can be done by doing Ctrl+C on the command prompt. The Node will be in decommissioning state forever and becomes unusable for new containers allocation. From v3 version of proposal in umbrella JIRA, "If CLI get interrupted, then it won’t keep track of timeout to forcefully decommissioned left nodes. However, nodes in “DECOMMISSIONING” will still get terminated later (after running apps get finished) except admin call recommission CLI on these nodes explicitly." The node in decommissioning state are terminated for 2 trigger events, one is timeout and the other is all application (on that node) get finished (will be covered in YARN-3212 ). We can document what's that means if user do Ctrl +C to gracefully decommissioning CLI. If application (like LRS) never ends in this situation, then user need to refresh these node forcefully (or gracefully with timeout but not interrupt). Make sense?
          Hide
          devaraj.k Devaraj K added a comment -

          Thanks Junping Du for explanation.

          If there are some long running containers in the NM and RMAdmin CLI gets terminated before issuing forceful decommission then the NM could in the “DECOMMISSIONING” state irrespective of timeout. AM I missing anything?

          AMs who has running containers on this NM will get noticed with preemption framework (with timeout), especially AM itself run against on this node.

          If we don't pass timeout to RM then how are we going to achieve this? You mean this will be handled later, once the basic things are done.

          the timeout can be updated to shorter or longer.

          With RMAdmin CLI handling timeout, If we want to make the timeout shorter then we can issue the command in new/same CLI with the shorter timeout, it would be fine. For making timeout longer, if we use new CLI then there is a chance of forceful decommission happening with the old CLI timeout. Is there any constraint like this needs to be done with the same CLI?

          Show
          devaraj.k Devaraj K added a comment - Thanks Junping Du for explanation. If there are some long running containers in the NM and RMAdmin CLI gets terminated before issuing forceful decommission then the NM could in the “DECOMMISSIONING” state irrespective of timeout. AM I missing anything? AMs who has running containers on this NM will get noticed with preemption framework (with timeout), especially AM itself run against on this node. If we don't pass timeout to RM then how are we going to achieve this? You mean this will be handled later, once the basic things are done. the timeout can be updated to shorter or longer. With RMAdmin CLI handling timeout, If we want to make the timeout shorter then we can issue the command in new/same CLI with the shorter timeout, it would be fine. For making timeout longer, if we use new CLI then there is a chance of forceful decommission happening with the old CLI timeout. Is there any constraint like this needs to be done with the same CLI?
          Hide
          djp Junping Du added a comment -

          Nice discussion, Devaraj K!

          If there are some long running containers in the NM and RMAdmin CLI gets terminated before issuing forceful decommission then the NM could in the “DECOMMISSIONING” state irrespective of timeout. AM I missing anything?

          If users terminate the blocking/pending CLI, then it only means they want to track timeout themselves or they want to adjust timeout value ahead or delay. In this case, the decommissioning nodes either get decommissioned when app finished (a clean quit), or wait user to decommission again later. We can add some alert messages later if some nodes are in decommissioning stage for really long time. The basic idea is we agree to not tracking timeout in RM side for each individual nodes.

          If we don't pass timeout to RM then how are we going to achieve this? You mean this will be handled later, once the basic things are done.

          You are right that timeout value could be useful to pass down to AM for preemption containers (however, no any effect on terminating nodes). Let's keep it here and we can leverage it later when we are notifying AM.

          For making timeout longer, if we use new CLI then there is a chance of forceful decommission happening with the old CLI timeout. Is there any constraint like this needs to be done with the same CLI?

          Not quite understanding the case described here. Users should terminate the current CLI and launch a new CLI for adjusted timeout values if they want to wait shorter or longer. If it already passed previous timeout values, current CLI should quit already with all nodes decommissioned. Am I missing something here?

          Show
          djp Junping Du added a comment - Nice discussion, Devaraj K ! If there are some long running containers in the NM and RMAdmin CLI gets terminated before issuing forceful decommission then the NM could in the “DECOMMISSIONING” state irrespective of timeout. AM I missing anything? If users terminate the blocking/pending CLI, then it only means they want to track timeout themselves or they want to adjust timeout value ahead or delay. In this case, the decommissioning nodes either get decommissioned when app finished (a clean quit), or wait user to decommission again later. We can add some alert messages later if some nodes are in decommissioning stage for really long time. The basic idea is we agree to not tracking timeout in RM side for each individual nodes. If we don't pass timeout to RM then how are we going to achieve this? You mean this will be handled later, once the basic things are done. You are right that timeout value could be useful to pass down to AM for preemption containers (however, no any effect on terminating nodes). Let's keep it here and we can leverage it later when we are notifying AM. For making timeout longer, if we use new CLI then there is a chance of forceful decommission happening with the old CLI timeout. Is there any constraint like this needs to be done with the same CLI? Not quite understanding the case described here. Users should terminate the current CLI and launch a new CLI for adjusted timeout values if they want to wait shorter or longer. If it already passed previous timeout values, current CLI should quit already with all nodes decommissioned. Am I missing something here?
          Hide
          devaraj.k Devaraj K added a comment -

          I was describing in my previous comment about having two RMAdmin CLI's for increasing the timeout value, one CLI runs with the timeout(say ‘x’) value and continues waiting for timeout, during this time another CLI issues the command with the higher timeout(>x). If we keep the CLI(x timeout) running it would issue the forceful decommission with x timeout and new CLI timeout(>x) will not reflect. If we have a constraint that we should issue graceful decommission command from only one RMAdmin CLI then this issue will not be a problem.

          Show
          devaraj.k Devaraj K added a comment - I was describing in my previous comment about having two RMAdmin CLI's for increasing the timeout value, one CLI runs with the timeout(say ‘x’) value and continues waiting for timeout, during this time another CLI issues the command with the higher timeout(>x). If we keep the CLI(x timeout) running it would issue the forceful decommission with x timeout and new CLI timeout(>x) will not reflect. If we have a constraint that we should issue graceful decommission command from only one RMAdmin CLI then this issue will not be a problem.
          Hide
          djp Junping Du added a comment -

          If we have a constraint that we should issue graceful decommission command from only one RMAdmin CLI then this issue will not be a problem.

          Can we have this assumption in our first phase (target to catch up 2.8)? IMO, Decommissioning nodes is very restrictive operation that we don't expect multiple could happen at the same time on a cluster. We can improve later if we think this is not good enough.

          Show
          djp Junping Du added a comment - If we have a constraint that we should issue graceful decommission command from only one RMAdmin CLI then this issue will not be a problem. Can we have this assumption in our first phase (target to catch up 2.8)? IMO, Decommissioning nodes is very restrictive operation that we don't expect multiple could happen at the same time on a cluster. We can improve later if we think this is not good enough.
          Hide
          djp Junping Du added a comment -

          One additional comments:
          For RMNodeEventType, DECOMMISSION_WITH_DELAY sounds better?

          RMNodeEventType.java
          @@ -24,6 +24,7 @@
             // Source: AdminService
             DECOMMISSION,
          +  DECOMMISSION_WITH_TIMEOUT,
          
          Show
          djp Junping Du added a comment - One additional comments: For RMNodeEventType, DECOMMISSION_WITH_DELAY sounds better? RMNodeEventType.java @@ -24,6 +24,7 @@ // Source: AdminService DECOMMISSION, + DECOMMISSION_WITH_TIMEOUT,
          Hide
          devaraj.k Devaraj K added a comment -

          Either way would be ok and both give relevant meaning. I would prefer to have the name in sync with the client side terminology as we are specifying the client side value as timeout.

          If we change the event to DECOMMISSION_WITH_DELAY, do you agree to change the client side name from timeout to delay?

          Show
          devaraj.k Devaraj K added a comment - Either way would be ok and both give relevant meaning. I would prefer to have the name in sync with the client side terminology as we are specifying the client side value as timeout. If we change the event to DECOMMISSION_WITH_DELAY, do you agree to change the client side name from timeout to delay?
          Hide
          djp Junping Du added a comment -

          I would prefer to have the name in sync with the client side terminology as we are specifying the client side value as timeout.

          +1.

          If we change the event to DECOMMISSION_WITH_DELAY, do you agree to change the client side name from timeout to delay?

          Hmmm. How about we change both with DECOMMISSION_DELAY_WITH_TIMEOUT?

          Show
          djp Junping Du added a comment - I would prefer to have the name in sync with the client side terminology as we are specifying the client side value as timeout. +1. If we change the event to DECOMMISSION_WITH_DELAY, do you agree to change the client side name from timeout to delay? Hmmm. How about we change both with DECOMMISSION_DELAY_WITH_TIMEOUT?
          Hide
          devaraj.k Devaraj K added a comment -

          I am thinking if we use the term 'delay', the users may assume/think that decommissioning would happen after certain amount of time as specified but here we are trying to achieve this as early as possible and within the max time as we specify. I feel timeout would be enough, anyway we can wait for some other to comment or suggest here.

          Show
          devaraj.k Devaraj K added a comment - I am thinking if we use the term 'delay', the users may assume/think that decommissioning would happen after certain amount of time as specified but here we are trying to achieve this as early as possible and within the max time as we specify. I feel timeout would be enough, anyway we can wait for some other to comment or suggest here.
          Hide
          djp Junping Du added a comment -

          I feel timeout would be enough, anyway we can wait for some other to comment or suggest here.

          I am fine for keeping timeout here for now. We can discuss the naming issue on other JIRAs later. Shouldn't block the major feature here.

          Show
          djp Junping Du added a comment - I feel timeout would be enough, anyway we can wait for some other to comment or suggest here. I am fine for keeping timeout here for now. We can discuss the naming issue on other JIRAs later. Shouldn't block the major feature here.
          Hide
          devaraj.k Devaraj K added a comment -

          ok sure, I will update the patch soon, Thanks.

          Show
          devaraj.k Devaraj K added a comment - ok sure, I will update the patch soon, Thanks.
          Hide
          devaraj.k Devaraj K added a comment -

          I have updated the patch as per above discussion. Please review it.

          Show
          devaraj.k Devaraj K added a comment - I have updated the patch as per above discussion. Please review it.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12706513/YARN-3225-1.patch
          against trunk revision 0b9f12c.

          -1 patch. Trunk compilation may be broken.

          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7072//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706513/YARN-3225-1.patch against trunk revision 0b9f12c. -1 patch . Trunk compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7072//console This message is automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12706513/YARN-3225-1.patch
          against trunk revision 7e6f384.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.yarn.server.resourcemanager.TestRM

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7077//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7077//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706513/YARN-3225-1.patch against trunk revision 7e6f384. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRM Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7077//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7077//console This message is automatically generated.
          Hide
          devaraj.k Devaraj K added a comment -
          org.apache.hadoop.yarn.server.resourcemanager.TestRM
          

          This test failure is not related to the patch.

          Show
          devaraj.k Devaraj K added a comment - org.apache.hadoop.yarn.server.resourcemanager.TestRM This test failure is not related to the patch.
          Hide
          djp Junping Du added a comment -

          Thanks Devaraj K for updating the patch!
          Sorry for coming late for some major comments:
          In ResourceManagerAdministrationProtocol.java,

          +
          +  public RefreshNodesGracefullyResponse refreshNodesGracefully(
          +      RefreshNodesGracefullyRequest refreshNodesGracefullyRequest)
          +      throws YarnException, IOException;
          +
          +  @Public
          +  @Evolving
          +  @Idempotent
          +  public RefreshNodesForcefullyResponse refreshNodesForcefully(
          +      RefreshNodesForcefullyRequest refreshNodesForcefullyRequest)
          +      throws YarnException, IOException;
          

          I think we don't have to add a new APIs here but can reuse existing refreshNodes(), we can add additional optional field (like boolean value) to RefreshNodesRequest to differentiate decommission immediately or with delay (gracefully). There should be no difference for decommission forcelly and previous decommission as there should be no side effect to decommission a decommissioned node (API marked with Idempotent). That could keep API much simpler.

          +  public CheckForDecommissioningNodesResponse checkForDecommissioningNodes(
          +      CheckForDecommissioningNodesRequest checkForDecommissioningNodesRequest)
          +      throws YarnException, IOException;
          

          May be it is better to add getDecommissioningNodes() to return a list of decommissioning nodes instead of returning a boolean value here? We can print it out the decommissioning nodes that haven't finished (or a subset of them if large size) when hitting timeout at the end. That could be helpful for Admin to understand things going on there.

          Show
          djp Junping Du added a comment - Thanks Devaraj K for updating the patch! Sorry for coming late for some major comments: In ResourceManagerAdministrationProtocol.java, + + public RefreshNodesGracefullyResponse refreshNodesGracefully( + RefreshNodesGracefullyRequest refreshNodesGracefullyRequest) + throws YarnException, IOException; + + @Public + @Evolving + @Idempotent + public RefreshNodesForcefullyResponse refreshNodesForcefully( + RefreshNodesForcefullyRequest refreshNodesForcefullyRequest) + throws YarnException, IOException; I think we don't have to add a new APIs here but can reuse existing refreshNodes(), we can add additional optional field (like boolean value) to RefreshNodesRequest to differentiate decommission immediately or with delay (gracefully). There should be no difference for decommission forcelly and previous decommission as there should be no side effect to decommission a decommissioned node (API marked with Idempotent). That could keep API much simpler. + public CheckForDecommissioningNodesResponse checkForDecommissioningNodes( + CheckForDecommissioningNodesRequest checkForDecommissioningNodesRequest) + throws YarnException, IOException; May be it is better to add getDecommissioningNodes() to return a list of decommissioning nodes instead of returning a boolean value here? We can print it out the decommissioning nodes that haven't finished (or a subset of them if large size) when hitting timeout at the end. That could be helpful for Admin to understand things going on there.
          Hide
          devaraj.k Devaraj K added a comment -

          Thanks Junping Du for your comments and review.

          There should be no difference for decommission forcelly and previous decommission as there should be no side effect to decommission a decommissioned node (API marked with Idempotent)

          If we make the refreshNodesForcefully as same as refreshNodes() API, there could be a chance of becoming node state incosistent if the hosts file updated with the newnodes or removal of existing nodes after the refreshNodesGracefully and before the refreshNodesForcefully. I feel we need to have difference that refreshNodes()/refreshNodesGracefully() should read the hosts file and refreshNodesForcefully() would not read the hosts file and only move the node state from DECOMMISSIONING to DECOMMISSIONED.

          May be it is better to add getDecommissioningNodes() to return a list of decommissioning nodes instead of returning a boolean value here? We can print it out the decommissioning nodes that haven't finished (or a subset of them if large size) when hitting timeout at the end.

          It would be good idea to show the decommissioning nodes, I will include this in the new patch.

          Show
          devaraj.k Devaraj K added a comment - Thanks Junping Du for your comments and review. There should be no difference for decommission forcelly and previous decommission as there should be no side effect to decommission a decommissioned node (API marked with Idempotent) If we make the refreshNodesForcefully as same as refreshNodes() API, there could be a chance of becoming node state incosistent if the hosts file updated with the newnodes or removal of existing nodes after the refreshNodesGracefully and before the refreshNodesForcefully. I feel we need to have difference that refreshNodes()/refreshNodesGracefully() should read the hosts file and refreshNodesForcefully() would not read the hosts file and only move the node state from DECOMMISSIONING to DECOMMISSIONED. May be it is better to add getDecommissioningNodes() to return a list of decommissioning nodes instead of returning a boolean value here? We can print it out the decommissioning nodes that haven't finished (or a subset of them if large size) when hitting timeout at the end. It would be good idea to show the decommissioning nodes, I will include this in the new patch.
          Hide
          djp Junping Du added a comment -

          Thanks for quickly response, Devaraj K!

          If we make the refreshNodesForcefully as same as refreshNodes() API, there could be a chance of becoming node state incosistent if the hosts file updated with the newnodes or removal of existing nodes after the refreshNodesGracefully and before the refreshNodesForcefully.

          +1. That's good point.

          I feel we need to have difference that refreshNodes()/refreshNodesGracefully() should read the hosts file and refreshNodesForcefully() would not read the hosts file and only move the node state from DECOMMISSIONING to DECOMMISSIONED.

          Agree that this sounds better. However, as my suggestion above, we can still keep the same Admin API. In AdminService side, we can handle request (not boolean value now, could be a new enum ) in three cases separately.

          Show
          djp Junping Du added a comment - Thanks for quickly response, Devaraj K ! If we make the refreshNodesForcefully as same as refreshNodes() API, there could be a chance of becoming node state incosistent if the hosts file updated with the newnodes or removal of existing nodes after the refreshNodesGracefully and before the refreshNodesForcefully. +1. That's good point. I feel we need to have difference that refreshNodes()/refreshNodesGracefully() should read the hosts file and refreshNodesForcefully() would not read the hosts file and only move the node state from DECOMMISSIONING to DECOMMISSIONED. Agree that this sounds better. However, as my suggestion above, we can still keep the same Admin API. In AdminService side, we can handle request (not boolean value now, could be a new enum ) in three cases separately.
          Hide
          devaraj.k Devaraj K added a comment -

          Updated the patch as per the above discussion, please review it.

          Show
          devaraj.k Devaraj K added a comment - Updated the patch as per the above discussion, please review it.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12708492/YARN-3225-2.patch
          against trunk revision 79f7f2a.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7176//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7176//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7176//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708492/YARN-3225-2.patch against trunk revision 79f7f2a. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 5 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7176//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7176//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7176//console This message is automatically generated.
          Hide
          devaraj.k Devaraj K added a comment -

          Attaching the patch with the findbug comment fix.

          Show
          devaraj.k Devaraj K added a comment - Attaching the patch with the findbug comment fix.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12708647/YARN-3225-3.patch
          against trunk revision c69ba81.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7188//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7188//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708647/YARN-3225-3.patch against trunk revision c69ba81. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 5 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7188//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7188//console This message is automatically generated.
          Hide
          devaraj.k Devaraj K added a comment -

          This failed test is not related to the patch.

          org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler
          
          Show
          devaraj.k Devaraj K added a comment - This failed test is not related to the patch. org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler
          Hide
          djp Junping Du added a comment -

          Thanks Devaraj K for the patch! The latest patch looks good in overall.
          Some minor comments:

          +  private long validateTimeout(String strTimeout) {
          +    long timeout;
          +    try {
          +      timeout = Long.parseLong(strTimeout);
          +    } catch (NumberFormatException ex) {
          +      throw new IllegalArgumentException(INVALID_TIMEOUT_ERR_MSG + strTimeout);
          +    }
          +    if (timeout < 0) {
          +      throw new IllegalArgumentException(INVALID_TIMEOUT_ERR_MSG + timeout);
          +    }
          +    return timeout;
          +  }
          

          I think we should support a case that Admin want node get decommissioned whenever all apps on these node get finished. If so, shall we support nigative value (anyone or some special one, like: -1) to specify this case?

          javadoc in DecommissionType.java

          +  /** Decomissioning nodes **/
          +  NORMAL,
          +
          +  /** Graceful decommissioning of nodes **/
          +  GRACEFUL,
          +
          +  /** Forceful decommissioning of nodes **/
          +  FORCEFUL
          

          For NORMAL, shall we use "Decommission nodes in normal (old) way" instead or something simpler- "Decommission nodes"?

          +@Private
          +@Unstable
          +public abstract class CheckForDecommissioningNodesRequest {
          +  @Public
          +  @Unstable
          +  public static CheckForDecommissioningNodesRequest newInstance() {
          +    CheckForDecommissioningNodesRequest request = Records
          +        .newRecord(CheckForDecommissioningNodesRequest.class);
          +    return request;
          +  }
          +}
          

          IMO, the methods inside a class should't be more public than class itself? If we don't expect other projects to use class, we alwasy don't expect some methods get used. The same problem happen in an old API RefreshNodeRequest.java. I think we may need to fix both?

             @Test
             public void testRefreshNodes() throws Exception {
               resourceManager.getClientRMService();
          -    RefreshNodesRequest request = recordFactory
          -            .newRecordInstance(RefreshNodesRequest.class);
          +    RefreshNodesRequest request = RefreshNodesRequest
          +        .newInstance(DecommissionType.NORMAL);
               RefreshNodesResponse response = client.refreshNodes(request);
               assertNotNull(response);
             }
          

          Why do we need this change? recordFactory.newRecordInstance(RefreshNodesRequest.class) will return something with DecommissionType.NORMAL as default. No?

          Show
          djp Junping Du added a comment - Thanks Devaraj K for the patch! The latest patch looks good in overall. Some minor comments: + private long validateTimeout( String strTimeout) { + long timeout; + try { + timeout = Long .parseLong(strTimeout); + } catch (NumberFormatException ex) { + throw new IllegalArgumentException(INVALID_TIMEOUT_ERR_MSG + strTimeout); + } + if (timeout < 0) { + throw new IllegalArgumentException(INVALID_TIMEOUT_ERR_MSG + timeout); + } + return timeout; + } I think we should support a case that Admin want node get decommissioned whenever all apps on these node get finished. If so, shall we support nigative value (anyone or some special one, like: -1) to specify this case? javadoc in DecommissionType.java + /** Decomissioning nodes **/ + NORMAL, + + /** Graceful decommissioning of nodes **/ + GRACEFUL, + + /** Forceful decommissioning of nodes **/ + FORCEFUL For NORMAL, shall we use "Decommission nodes in normal (old) way" instead or something simpler- "Decommission nodes"? +@Private +@Unstable + public abstract class CheckForDecommissioningNodesRequest { + @Public + @Unstable + public static CheckForDecommissioningNodesRequest newInstance() { + CheckForDecommissioningNodesRequest request = Records + .newRecord(CheckForDecommissioningNodesRequest.class); + return request; + } +} IMO, the methods inside a class should't be more public than class itself? If we don't expect other projects to use class, we alwasy don't expect some methods get used. The same problem happen in an old API RefreshNodeRequest.java. I think we may need to fix both? @Test public void testRefreshNodes() throws Exception { resourceManager.getClientRMService(); - RefreshNodesRequest request = recordFactory - .newRecordInstance(RefreshNodesRequest.class); + RefreshNodesRequest request = RefreshNodesRequest + .newInstance(DecommissionType.NORMAL); RefreshNodesResponse response = client.refreshNodes(request); assertNotNull(response); } Why do we need this change? recordFactory.newRecordInstance(RefreshNodesRequest.class) will return something with DecommissionType.NORMAL as default. No?
          Hide
          devaraj.k Devaraj K added a comment -

          Thanks Junping Du for your review.

          I think we should support a case that Admin want node get decommissioned whenever all apps on these node get finished. If so, shall we support nigative value (anyone or some special one, like: -1) to specify this case?

          If the user wants to achieve this, they can give some larger timeout value and wait for all nodes to get decommissioned gracefully(without forceful). Do we really need to provide special handling for this case?

          For NORMAL, shall we use "Decommission nodes in normal (old) way" instead or something simpler- "Decommission nodes"?

          I feel "Decommission nodes in normal way" would be ok, no need to mention the 'old' term. What is your opinion on this?

          IMO, the methods inside a class should't be more public than class itself? If we don't expect other projects to use class, we alwasy don't expect some methods get used. The same problem happen in an old API RefreshNodeRequest.java. I think we may need to fix both?

          I agree, I will fix both of them.

          Why do we need this change? recordFactory.newRecordInstance(RefreshNodesRequest.class) will return something with DecommissionType.NORMAL as default. No?

          It will not give any difference because the NORMAL is the default. I made this change to make it consistent with other decommission types.

          Show
          devaraj.k Devaraj K added a comment - Thanks Junping Du for your review. I think we should support a case that Admin want node get decommissioned whenever all apps on these node get finished. If so, shall we support nigative value (anyone or some special one, like: -1) to specify this case? If the user wants to achieve this, they can give some larger timeout value and wait for all nodes to get decommissioned gracefully(without forceful). Do we really need to provide special handling for this case? For NORMAL, shall we use "Decommission nodes in normal (old) way" instead or something simpler- "Decommission nodes"? I feel "Decommission nodes in normal way" would be ok, no need to mention the 'old' term. What is your opinion on this? IMO, the methods inside a class should't be more public than class itself? If we don't expect other projects to use class, we alwasy don't expect some methods get used. The same problem happen in an old API RefreshNodeRequest.java. I think we may need to fix both? I agree, I will fix both of them. Why do we need this change? recordFactory.newRecordInstance(RefreshNodesRequest.class) will return something with DecommissionType.NORMAL as default. No? It will not give any difference because the NORMAL is the default. I made this change to make it consistent with other decommission types.
          Hide
          djp Junping Du added a comment -

          Thanks Devaraj K for replying.

          If the user wants to achieve this, they can give some larger timeout value and wait for all nodes to get decommissioned gracefully(without forceful). Do we really need to provide special handling for this case?

          It would be great if we can support this case because users doesn't have to think out a large number for an important job and doesn't known when to end. Given this is a trivial effort comparing what you already achieved, we'd better do here instead of filing a separated JIRA. What do you think?

          I feel "Decommission nodes in normal way" would be ok, no need to mention the 'old' term. What is your opinion on this?

          Yes. That sounds good. My previous point is not to mention "decommissioning" for normal/previous decommission process to get rid of any confusing.

          Show
          djp Junping Du added a comment - Thanks Devaraj K for replying. If the user wants to achieve this, they can give some larger timeout value and wait for all nodes to get decommissioned gracefully(without forceful). Do we really need to provide special handling for this case? It would be great if we can support this case because users doesn't have to think out a large number for an important job and doesn't known when to end. Given this is a trivial effort comparing what you already achieved, we'd better do here instead of filing a separated JIRA. What do you think? I feel "Decommission nodes in normal way" would be ok, no need to mention the 'old' term. What is your opinion on this? Yes. That sounds good. My previous point is not to mention "decommissioning" for normal/previous decommission process to get rid of any confusing.
          Hide
          devaraj.k Devaraj K added a comment -

          Thanks Junping Du for your immediate response.

          I have updated the patch with the case that waits for all the nodes to be gracefully decommissioned when the timeout value is -1 along with the other comments fix.

          Show
          devaraj.k Devaraj K added a comment - Thanks Junping Du for your immediate response. I have updated the patch with the case that waits for all the nodes to be gracefully decommissioned when the timeout value is -1 along with the other comments fix.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12724185/YARN-3225-4.patch
          against trunk revision 6495940.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.yarn.client.api.impl.TestTimelineClient

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7268//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7268//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12724185/YARN-3225-4.patch against trunk revision 6495940. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 5 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.client.api.impl.TestTimelineClient Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7268//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7268//console This message is automatically generated.
          Hide
          djp Junping Du added a comment -

          Latest patch looks good to me. The failure of TestTimelineClient sounds unrelated and intermittent.
          Will kick off Jenkins test again. +1 based on Jenkins result.

          Show
          djp Junping Du added a comment - Latest patch looks good to me. The failure of TestTimelineClient sounds unrelated and intermittent. Will kick off Jenkins test again. +1 based on Jenkins result.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12724185/YARN-3225-4.patch
          against trunk revision 83979e6.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.yarn.client.api.impl.TestYarnClient

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7296//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7296//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12724185/YARN-3225-4.patch against trunk revision 83979e6. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 5 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.client.api.impl.TestYarnClient Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7296//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7296//console This message is automatically generated.
          Hide
          devaraj.k Devaraj K added a comment -

          Thanks Junping Du for review.

          This below test failure is also not related to the patch.

          org.apache.hadoop.yarn.client.api.impl.TestYarnClient
          
          Show
          devaraj.k Devaraj K added a comment - Thanks Junping Du for review. This below test failure is also not related to the patch. org.apache.hadoop.yarn.client.api.impl.TestYarnClient
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12724185/YARN-3225-4.patch
          against trunk revision b5a0b24.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7328//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7328//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12724185/YARN-3225-4.patch against trunk revision b5a0b24. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 5 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7328//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7328//console This message is automatically generated.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          Thanks Devaraj K for patch and Junping Du for review.
          Overall approach and patch looks good to me. One suggestion on the patch is DecommissionType#FORCEFUL java doc is bit confusing. Forceful decommissioning of nodes Since FORCEFUL type is applied only on the nodes which are decommissioning state, RUNNING state nodes can not be forcefully decommissioned.

          Show
          rohithsharma Rohith Sharma K S added a comment - Thanks Devaraj K for patch and Junping Du for review. Overall approach and patch looks good to me. One suggestion on the patch is DecommissionType#FORCEFUL java doc is bit confusing. Forceful decommissioning of nodes Since FORCEFUL type is applied only on the nodes which are decommissioning state, RUNNING state nodes can not be forcefully decommissioned.
          Hide
          devaraj.k Devaraj K added a comment -

          Thanks Rohith Sharma K S for your review and feedback. I have incorporated your inline comment suggestion in the updated patch.

          Show
          devaraj.k Devaraj K added a comment - Thanks Rohith Sharma K S for your review and feedback. I have incorporated your inline comment suggestion in the updated patch.
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 14m 54s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 5 new or modified test files.
          -1 whitespace 0m 0s The patch has 2 line(s) that end in whitespace.
          +1 javac 7m 46s There were no new javac warning messages.
          +1 javadoc 9m 46s There were no new javadoc warning messages.
          +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings.
          -1 checkstyle 5m 28s The applied patch generated 18 additional checkstyle issues.
          +1 install 1m 33s mvn install still works.
          +1 eclipse:eclipse 0m 32s The patch built with eclipse:eclipse.
          +1 findbugs 4m 48s The patch does not introduce any new Findbugs (version 2.0.3) warnings.
          +1 yarn tests 0m 25s Tests passed in hadoop-yarn-api.
          +1 yarn tests 7m 17s Tests passed in hadoop-yarn-client.
          +1 yarn tests 2m 1s Tests passed in hadoop-yarn-common.
          +1 yarn tests 55m 49s Tests passed in hadoop-yarn-server-resourcemanager.
              110m 44s  



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12727006/YARN-3225-5.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / b08908a
          whitespace https://builds.apache.org/job/PreCommit-YARN-Build/7446/artifact/patchprocess/whitespace.txt
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/7446/artifact/patchprocess/checkstyle-result-diff.txt
          hadoop-yarn-api test log https://builds.apache.org/job/PreCommit-YARN-Build/7446/artifact/patchprocess/testrun_hadoop-yarn-api.txt
          hadoop-yarn-client test log https://builds.apache.org/job/PreCommit-YARN-Build/7446/artifact/patchprocess/testrun_hadoop-yarn-client.txt
          hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/7446/artifact/patchprocess/testrun_hadoop-yarn-common.txt
          hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/7446/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/7446/testReport/
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/7446//console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 14m 54s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 5 new or modified test files. -1 whitespace 0m 0s The patch has 2 line(s) that end in whitespace. +1 javac 7m 46s There were no new javac warning messages. +1 javadoc 9m 46s There were no new javadoc warning messages. +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings. -1 checkstyle 5m 28s The applied patch generated 18 additional checkstyle issues. +1 install 1m 33s mvn install still works. +1 eclipse:eclipse 0m 32s The patch built with eclipse:eclipse. +1 findbugs 4m 48s The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 yarn tests 0m 25s Tests passed in hadoop-yarn-api. +1 yarn tests 7m 17s Tests passed in hadoop-yarn-client. +1 yarn tests 2m 1s Tests passed in hadoop-yarn-common. +1 yarn tests 55m 49s Tests passed in hadoop-yarn-server-resourcemanager.     110m 44s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12727006/YARN-3225-5.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / b08908a whitespace https://builds.apache.org/job/PreCommit-YARN-Build/7446/artifact/patchprocess/whitespace.txt checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/7446/artifact/patchprocess/checkstyle-result-diff.txt hadoop-yarn-api test log https://builds.apache.org/job/PreCommit-YARN-Build/7446/artifact/patchprocess/testrun_hadoop-yarn-api.txt hadoop-yarn-client test log https://builds.apache.org/job/PreCommit-YARN-Build/7446/artifact/patchprocess/testrun_hadoop-yarn-client.txt hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/7446/artifact/patchprocess/testrun_hadoop-yarn-common.txt hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/7446/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/7446/testReport/ Console output https://builds.apache.org/job/PreCommit-YARN-Build/7446//console This message was automatically generated.
          Hide
          djp Junping Du added a comment -

          v5 patch LGTM. The new Jenkins result just report some trivial format issue but failed to report details to improve it.
          +1. I will go ahead to commit it shortly.

          Show
          djp Junping Du added a comment - v5 patch LGTM. The new Jenkins result just report some trivial format issue but failed to report details to improve it. +1. I will go ahead to commit it shortly.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          +1(non-binding) LGTM.

          Show
          rohithsharma Rohith Sharma K S added a comment - +1(non-binding) LGTM.
          Hide
          djp Junping Du added a comment -

          I have commit the latest (v5) patch to trunk and branch-2.
          Thanks Devaraj K for contributing the patch and review comments from Sunil G and Rohith Sharma K S!

          Show
          djp Junping Du added a comment - I have commit the latest (v5) patch to trunk and branch-2. Thanks Devaraj K for contributing the patch and review comments from Sunil G and Rohith Sharma K S !
          Hide
          andrew.wang Andrew Wang added a comment -

          FYI for git log greppers that the JIRA ID is missing in the commit message.

          Show
          andrew.wang Andrew Wang added a comment - FYI for git log greppers that the JIRA ID is missing in the commit message.
          Hide
          djp Junping Du added a comment -

          Sorry for missing this. I felt the same pain now in releasing 2.8.0.

          Show
          djp Junping Du added a comment - Sorry for missing this. I felt the same pain now in releasing 2.8.0.

            People

            • Assignee:
              devaraj.k Devaraj K
              Reporter:
              djp Junping Du
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development