Hadoop YARN
  1. Hadoop YARN
  2. YARN-63

RMNodeImpl is missing valid transitions from the UNHEALTHY state

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.23.3, 2.0.0-alpha
    • Fix Version/s: 2.0.2-alpha, 0.23.3
    • Component/s: resourcemanager
    • Labels:
      None

      Description

      The ResourceManager isn't properly handling nodes that have been marked UNHEALTHY when they are lost or are decommissioned.

      1. YARN-63-branch-0.23.patch
        5 kB
        Jason Lowe
      2. YARN-63.patch
        6 kB
        Jason Lowe

        Activity

        Hide
        Jason Lowe added a comment -

        Sample exception stacktrace from the ResourceManager:

         [ResourceManager Event Processor]2012-08-29 07:12:04,556 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:xxx:xxx Timed out after 600 secs
         [Ping Checker]2012-08-29 07:12:04,556 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle this event at current state
         [AsyncDispatcher event handler]org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: EXPIRE at UNHEALTHY
                at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301)
                at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
                at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
                at org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:301)
                at org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:65)
                at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:446)
                at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:430)
                at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126)
                at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
                at java.lang.Thread.run(Thread.java:619)
        2012-08-29 07:12:04,556 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event EXPIRE on Node  xxx:xxx
        
        Show
        Jason Lowe added a comment - Sample exception stacktrace from the ResourceManager: [ResourceManager Event Processor]2012-08-29 07:12:04,556 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:xxx:xxx Timed out after 600 secs [Ping Checker]2012-08-29 07:12:04,556 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle this event at current state [AsyncDispatcher event handler]org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: EXPIRE at UNHEALTHY at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443) at org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:301) at org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:65) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:446) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:430) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75) at java.lang.Thread.run(Thread.java:619) 2012-08-29 07:12:04,556 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event EXPIRE on Node xxx:xxx
        Hide
        Jason Lowe added a comment -

        Straightforward patch to add support for REBOOTING, DECOMMISSION, and EXPIRE events for UNHEALTHY nodes.

        Show
        Jason Lowe added a comment - Straightforward patch to add support for REBOOTING, DECOMMISSION, and EXPIRE events for UNHEALTHY nodes.
        Hide
        Jason Lowe added a comment -

        Patch is targeted for trunk/2.x. I see branch-0.23 is missing support for CLEANUP_APP and CLEANUP_CONTAINER which were added as part of MAPREDUCE-3533, so we should add handling of those events in the branch-0.23 patch if we're not going to pull in MAPREDUCE-3533.

        Show
        Jason Lowe added a comment - Patch is targeted for trunk/2.x. I see branch-0.23 is missing support for CLEANUP_APP and CLEANUP_CONTAINER which were added as part of MAPREDUCE-3533 , so we should add handling of those events in the branch-0.23 patch if we're not going to pull in MAPREDUCE-3533 .
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12543149/YARN-63.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-YARN-Build/12//testReport/
        Console output: https://builds.apache.org/job/PreCommit-YARN-Build/12//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12543149/YARN-63.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/12//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/12//console This message is automatically generated.
        Hide
        Jason Lowe added a comment -

        Patch for branch-0.23 which includes the missed handling of CLEANUP_APP and CLEANUP_CONTAINER which was already in trunk.

        Show
        Jason Lowe added a comment - Patch for branch-0.23 which includes the missed handling of CLEANUP_APP and CLEANUP_CONTAINER which was already in trunk.
        Hide
        Robert Joseph Evans added a comment -

        The changes look good +1.

        I am a bit curious about the DECOMMISSIONED and REBOOTED states. Is it possible for a CLEANUP_APP, or CLEANUP_CONTAINER event to show up after the node has transitioned to these states? If this is an issue we can address it on a separate JIRA.

        Show
        Robert Joseph Evans added a comment - The changes look good +1. I am a bit curious about the DECOMMISSIONED and REBOOTED states. Is it possible for a CLEANUP_APP, or CLEANUP_CONTAINER event to show up after the node has transitioned to these states? If this is an issue we can address it on a separate JIRA.
        Hide
        Robert Joseph Evans added a comment -

        Thanks Jason,

        I put this into trunk, branch-2, and branch-0.23

        Show
        Robert Joseph Evans added a comment - Thanks Jason, I put this into trunk, branch-2, and branch-0.23
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk-Commit #2720 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2720/)
        YARN-63. RMNodeImpl is missing valid transitions from the UNHEALTHY state (Jason Lowe via bobby) (Revision 1379498)

        Result = SUCCESS
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1379498
        Files :

        • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
        • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #2720 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2720/ ) YARN-63 . RMNodeImpl is missing valid transitions from the UNHEALTHY state (Jason Lowe via bobby) (Revision 1379498) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1379498 Files : /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk-Commit #2657 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2657/)
        YARN-63. RMNodeImpl is missing valid transitions from the UNHEALTHY state (Jason Lowe via bobby) (Revision 1379498)

        Result = SUCCESS
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1379498
        Files :

        • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
        • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #2657 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2657/ ) YARN-63 . RMNodeImpl is missing valid transitions from the UNHEALTHY state (Jason Lowe via bobby) (Revision 1379498) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1379498 Files : /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #2686 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2686/)
        YARN-63. RMNodeImpl is missing valid transitions from the UNHEALTHY state (Jason Lowe via bobby) (Revision 1379498)

        Result = FAILURE
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1379498
        Files :

        • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
        • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #2686 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2686/ ) YARN-63 . RMNodeImpl is missing valid transitions from the UNHEALTHY state (Jason Lowe via bobby) (Revision 1379498) Result = FAILURE bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1379498 Files : /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-0.23-Build #361 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/361/)
        YARN-63. RMNodeImpl is missing valid transitions from the UNHEALTHY state (Jason Lowe via bobby) (Revision 1379501)

        Result = UNSTABLE
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1379501
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-yarn-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
        • /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #361 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/361/ ) YARN-63 . RMNodeImpl is missing valid transitions from the UNHEALTHY state (Jason Lowe via bobby) (Revision 1379501) Result = UNSTABLE bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1379501 Files : /hadoop/common/branches/branch-0.23/hadoop-yarn-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #1152 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1152/)
        YARN-63. RMNodeImpl is missing valid transitions from the UNHEALTHY state (Jason Lowe via bobby) (Revision 1379498)

        Result = SUCCESS
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1379498
        Files :

        • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
        • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1152 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1152/ ) YARN-63 . RMNodeImpl is missing valid transitions from the UNHEALTHY state (Jason Lowe via bobby) (Revision 1379498) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1379498 Files : /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #1183 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1183/)
        YARN-63. RMNodeImpl is missing valid transitions from the UNHEALTHY state (Jason Lowe via bobby) (Revision 1379498)

        Result = SUCCESS
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1379498
        Files :

        • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
        • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1183 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1183/ ) YARN-63 . RMNodeImpl is missing valid transitions from the UNHEALTHY state (Jason Lowe via bobby) (Revision 1379498) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1379498 Files : /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
        Hide
        Jason Lowe added a comment -

        I am a bit curious about the DECOMMISSIONED and REBOOTED states. Is it possible for a CLEANUP_APP, or CLEANUP_CONTAINER event to show up after the node has transitioned to these states?

        It definitely can receive CLEANUP_APP as I've seen that exact case in an RM log. That event is sent to all nodes that ran containers for an application when the application completes. CLEANUP_CONTAINER could be received if the container expires just as the node transitions to UNHEALTHY. So I believe it's appropriate to handle these events in the UNHEALTHY state.

        Show
        Jason Lowe added a comment - I am a bit curious about the DECOMMISSIONED and REBOOTED states. Is it possible for a CLEANUP_APP, or CLEANUP_CONTAINER event to show up after the node has transitioned to these states? It definitely can receive CLEANUP_APP as I've seen that exact case in an RM log. That event is sent to all nodes that ran containers for an application when the application completes. CLEANUP_CONTAINER could be received if the container expires just as the node transitions to UNHEALTHY. So I believe it's appropriate to handle these events in the UNHEALTHY state.
        Hide
        Robert Joseph Evans added a comment -

        It would also be nice to change it so that when an invalid transition happens in the node that more then just an error is logged, but I am not really sure what else it should do. So adding in those transitions are probably not that critical.

        Show
        Robert Joseph Evans added a comment - It would also be nice to change it so that when an invalid transition happens in the node that more then just an error is logged, but I am not really sure what else it should do. So adding in those transitions are probably not that critical.

          People

          • Assignee:
            Jason Lowe
            Reporter:
            Jason Lowe
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development