Hadoop YARN
  1. Hadoop YARN
  2. YARN-214

RMContainerImpl does not handle event EXPIRE at state RUNNING

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.23.3, 2.0.1-alpha
    • Fix Version/s: 2.0.3-alpha, 0.23.5
    • Component/s: resourcemanager
    • Labels:
      None

      Description

      RMContainerImpl has a race condition where a container can enter the RUNNING state just as the container expires. This results in an invalid event transition error:

      2012-11-11 05:31:38,954 [ResourceManager Event Processor] ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state
      org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: EXPIRE at RUNNING
              at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301)
              at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
              at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
              at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:205)
              at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:44)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApp.containerCompleted(SchedulerApp.java:203)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1337)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:739)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:659)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:80)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:340)
              at java.lang.Thread.run(Thread.java:619)
      

      EXPIRE needs to be handled (well at least ignored) in the RUNNING state to account for this race condition.

      1. YARN-214.patch
        5 kB
        Jonathan Eagles
      2. YARN-214.patch
        5 kB
        Jonathan Eagles
      3. YARN-214.patch
        5 kB
        Jonathan Eagles
      4. YARN-214.patch
        5 kB
        Jonathan Eagles
      5. YARN-214.patch
        6 kB
        Jonathan Eagles

        Activity

        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12553451/YARN-214.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-YARN-Build/147//testReport/
        Console output: https://builds.apache.org/job/PreCommit-YARN-Build/147//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12553451/YARN-214.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/147//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/147//console This message is automatically generated.
        Hide
        Jason Lowe added a comment -

        We don't want to kill containers that receive EXPIRE in the RUNNING state.

        The EXPIRE event is a watchdog for containers that are acquired but don't ever get up and running. When they make it to the RUNNING state then we explicitly cancel the timer that would trigger the EXPIRE event. In the real-world case we saw, the RM was processing events slowly, and so the LAUNCHED event wasn't pulled out of the event queue and processed before the timer expired and therefore we get the EXPIRE event in the RUNNING state.

        In this case we should just ignore the EXPIRE event in the RUNNING state since the container is now running and there's no need to kill it. Arguably, if we're backed up enough in processing events and the container exits quickly, we could get EXPIRE in the COMPLETED state as well.

        Show
        Jason Lowe added a comment - We don't want to kill containers that receive EXPIRE in the RUNNING state. The EXPIRE event is a watchdog for containers that are acquired but don't ever get up and running. When they make it to the RUNNING state then we explicitly cancel the timer that would trigger the EXPIRE event. In the real-world case we saw, the RM was processing events slowly, and so the LAUNCHED event wasn't pulled out of the event queue and processed before the timer expired and therefore we get the EXPIRE event in the RUNNING state. In this case we should just ignore the EXPIRE event in the RUNNING state since the container is now running and there's no need to kill it. Arguably, if we're backed up enough in processing events and the container exits quickly, we could get EXPIRE in the COMPLETED state as well.
        Hide
        Jonathan Eagles added a comment -

        I changed the transition to a no-op and removed the app attempt captor since the app attempt isn't being modified now

        Show
        Jonathan Eagles added a comment - I changed the transition to a no-op and removed the app attempt captor since the app attempt isn't being modified now
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12553734/YARN-214.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-YARN-Build/148//testReport/
        Console output: https://builds.apache.org/job/PreCommit-YARN-Build/148//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12553734/YARN-214.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/148//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/148//console This message is automatically generated.
        Hide
        Robert Joseph Evans added a comment -

        Now that I understand the code better I think that ignoring the EXPIRE at the RUNNING state makes since. The EXPIRE event only happens when a container has been waiting in allocated for more then 10 min (default config). This really would only happen when an App has gotten a container and forgotten about it, or when the RM is running very slow and not processed the transition events by the time the EXPIRE event is sent.

        We register for the Expire event in the AquiredTransition going to the AQUIRED State, so we need to handle the EXPIRE event at all states that are reachable from the AQUIRED state, and have not already processed the Expire event. This means we need to handle this in the KILLED, RUNNING, COMPLETED, and RELEASED. We need to add this to KILLED and RELEASED too.

        Show
        Robert Joseph Evans added a comment - Now that I understand the code better I think that ignoring the EXPIRE at the RUNNING state makes since. The EXPIRE event only happens when a container has been waiting in allocated for more then 10 min (default config). This really would only happen when an App has gotten a container and forgotten about it, or when the RM is running very slow and not processed the transition events by the time the EXPIRE event is sent. We register for the Expire event in the AquiredTransition going to the AQUIRED State, so we need to handle the EXPIRE event at all states that are reachable from the AQUIRED state, and have not already processed the Expire event. This means we need to handle this in the KILLED, RUNNING, COMPLETED, and RELEASED. We need to add this to KILLED and RELEASED too.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12553803/YARN-214.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-YARN-Build/149//testReport/
        Console output: https://builds.apache.org/job/PreCommit-YARN-Build/149//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12553803/YARN-214.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/149//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/149//console This message is automatically generated.
        Hide
        Robert Joseph Evans added a comment -

        +1 the changes look good. I'll check them in.

        Show
        Robert Joseph Evans added a comment - +1 the changes look good. I'll check them in.
        Hide
        Robert Joseph Evans added a comment -

        Thanks Jon,

        I put this in trunk, branch-2, and branch-0.23

        Show
        Robert Joseph Evans added a comment - Thanks Jon, I put this in trunk, branch-2, and branch-0.23
        Hide
        Hudson added a comment -

        Integrated in Hadoop-trunk-Commit #3032 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3032/)
        YARN-214. RMContainerImpl does not handle event EXPIRE at state RUNNING (jeagles via bobby) (Revision 1410522)

        Result = SUCCESS
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1410522
        Files :

        • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java
        • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/TestRMContainerImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-trunk-Commit #3032 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3032/ ) YARN-214 . RMContainerImpl does not handle event EXPIRE at state RUNNING (jeagles via bobby) (Revision 1410522) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1410522 Files : /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/TestRMContainerImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Yarn-trunk #39 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/39/)
        YARN-214. RMContainerImpl does not handle event EXPIRE at state RUNNING (jeagles via bobby) (Revision 1410522)

        Result = SUCCESS
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1410522
        Files :

        • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java
        • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/TestRMContainerImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Yarn-trunk #39 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/39/ ) YARN-214 . RMContainerImpl does not handle event EXPIRE at state RUNNING (jeagles via bobby) (Revision 1410522) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1410522 Files : /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/TestRMContainerImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-0.23-Build #438 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/438/)
        svn merge -c 1410522 FIXES: YARN-214. RMContainerImpl does not handle event EXPIRE at state RUNNING (jeagles via bobby) (Revision 1410527)

        Result = SUCCESS
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1410527
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-yarn-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java
        • /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/TestRMContainerImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #438 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/438/ ) svn merge -c 1410522 FIXES: YARN-214 . RMContainerImpl does not handle event EXPIRE at state RUNNING (jeagles via bobby) (Revision 1410527) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1410527 Files : /hadoop/common/branches/branch-0.23/hadoop-yarn-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/TestRMContainerImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #1229 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1229/)
        YARN-214. RMContainerImpl does not handle event EXPIRE at state RUNNING (jeagles via bobby) (Revision 1410522)

        Result = SUCCESS
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1410522
        Files :

        • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java
        • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/TestRMContainerImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1229 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1229/ ) YARN-214 . RMContainerImpl does not handle event EXPIRE at state RUNNING (jeagles via bobby) (Revision 1410522) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1410522 Files : /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/TestRMContainerImpl.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #1260 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1260/)
        YARN-214. RMContainerImpl does not handle event EXPIRE at state RUNNING (jeagles via bobby) (Revision 1410522)

        Result = FAILURE
        bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1410522
        Files :

        • /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java
        • /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/TestRMContainerImpl.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1260 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1260/ ) YARN-214 . RMContainerImpl does not handle event EXPIRE at state RUNNING (jeagles via bobby) (Revision 1410522) Result = FAILURE bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1410522 Files : /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/TestRMContainerImpl.java

          People

          • Assignee:
            Jonathan Eagles
            Reporter:
            Jason Lowe
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development