Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-2416

InvalidStateTransitonException in ResourceManager if AMLauncher does not receive response for startContainers() call in time

    Details

    • Hadoop Flags:
      Reviewed

      Description

      AMLauncher calls startContainers(allRequests) to launch a container for application master. Normally, the call comes back immediately so that the RMAppAttempt changes its state from ALLOCATED to LAUNCHED.

      However, we do observed that in some cases, the RPC call came back very late but the AM container was already started. Because the RMAppAttempt stuck in ALLOCATED state, once resource manager received the REGISTERED event from the application master, it threw InvalidStateTransitonException as follows.

      2014-07-05 08:59:05,021 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state
      org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: REGISTERED at ALLOCATED
      at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
      at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
      at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
      at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:652)
      at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:106)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:752)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:733)
      at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
      at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
      at java.lang.Thread.run(Thread.java:744)

      For subsequent STATUS_UPDATE and CONTAINER_ALLOCATED events for this job, resource manager kept throwing InvalidStateTransitonException.

      2014-07-05 08:59:06,152 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state
      org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: STATUS_UPDATE at ALLOCATED
      at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
      at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
      at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
      at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:652)
      at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:106)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:752)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:733)
      at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
      at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
      at java.lang.Thread.run(Thread.java:744)
      2014-07-05 08:59:07,779 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1404549222428_0001_02_000002 Container Transitioned from NEW to
      ALLOCATED
      2014-07-05 08:59:07,779 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state
      org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: CONTAINER_ALLOCATED at ALLOCATED
      at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
      at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
      at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
      at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:652)
      at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:106)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:752)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:733)
      at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
      at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
      at java.lang.Thread.run(Thread.java:744)

      1. YARN-2416.001.patch
        8 kB
        Jonathan Eagles
      2. YARN-2416.002.patch
        8 kB
        Jonathan Eagles
      3. YARN-2416.003.patch
        9 kB
        Jonathan Eagles

        Activity

        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #12226 (See https://builds.apache.org/job/Hadoop-trunk-Commit/12226/)
        YARN-2416. InvalidStateTransitonException in ResourceManager if (jlowe: rev 3efcd51c3b3eb667d83e08b500bb7a7ea559fabe)

        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java
        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #12226 (See https://builds.apache.org/job/Hadoop-trunk-Commit/12226/ ) YARN-2416 . InvalidStateTransitonException in ResourceManager if (jlowe: rev 3efcd51c3b3eb667d83e08b500bb7a7ea559fabe) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
        Hide
        jlowe Jason Lowe added a comment -

        Thanks, Jonathan Eagles! I committed this to trunk, branch-2, branch-2.8, and branch-2.8.2.

        Show
        jlowe Jason Lowe added a comment - Thanks, Jonathan Eagles ! I committed this to trunk, branch-2, branch-2.8, and branch-2.8.2.
        Hide
        jlowe Jason Lowe added a comment -

        +1 for the latest patch. Committing this.

        Show
        jlowe Jason Lowe added a comment - +1 for the latest patch. Committing this.
        Hide
        jeagles Jonathan Eagles added a comment - - edited

        YARN-7044 is the cause of the test failure. Jason Lowe, can you review the 003 version of this patch. I have added ignored transitions for the states that were missing the LAUNCHED event.

        Show
        jeagles Jonathan Eagles added a comment - - edited YARN-7044 is the cause of the test failure. Jason Lowe , can you review the 003 version of this patch. I have added ignored transitions for the states that were missing the LAUNCHED event.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 20s Docker mode activated.
              Prechecks
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
              trunk Compile Tests
        +1 mvninstall 18m 32s trunk passed
        +1 compile 0m 43s trunk passed
        +1 checkstyle 0m 34s trunk passed
        +1 mvnsite 0m 50s trunk passed
        +1 findbugs 1m 20s trunk passed
        +1 javadoc 0m 28s trunk passed
              Patch Compile Tests
        +1 mvninstall 0m 45s the patch passed
        +1 compile 0m 45s the patch passed
        +1 javac 0m 45s the patch passed
        +1 checkstyle 0m 32s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 0 new + 197 unchanged - 1 fixed = 197 total (was 198)
        +1 mvnsite 0m 46s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 26s the patch passed
        +1 javadoc 0m 25s the patch passed
              Other Tests
        -1 unit 45m 15s hadoop-yarn-server-resourcemanager in the patch failed.
        +1 asflicense 0m 18s The patch does not generate ASF License warnings.
        74m 27s



        Reason Tests
        Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:14b5c93
        JIRA Issue YARN-2416
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12882969/YARN-2416.003.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 091941edb67d 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / b6bfb2f
        Default Java 1.8.0_144
        findbugs v3.1.0-RC1
        unit https://builds.apache.org/job/PreCommit-YARN-Build/17040/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/17040/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/17040/console
        Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 20s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.       trunk Compile Tests +1 mvninstall 18m 32s trunk passed +1 compile 0m 43s trunk passed +1 checkstyle 0m 34s trunk passed +1 mvnsite 0m 50s trunk passed +1 findbugs 1m 20s trunk passed +1 javadoc 0m 28s trunk passed       Patch Compile Tests +1 mvninstall 0m 45s the patch passed +1 compile 0m 45s the patch passed +1 javac 0m 45s the patch passed +1 checkstyle 0m 32s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 0 new + 197 unchanged - 1 fixed = 197 total (was 198) +1 mvnsite 0m 46s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 26s the patch passed +1 javadoc 0m 25s the patch passed       Other Tests -1 unit 45m 15s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 18s The patch does not generate ASF License warnings. 74m 27s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue YARN-2416 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12882969/YARN-2416.003.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 091941edb67d 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / b6bfb2f Default Java 1.8.0_144 findbugs v3.1.0-RC1 unit https://builds.apache.org/job/PreCommit-YARN-Build/17040/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/17040/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/17040/console Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        jeagles Jonathan Eagles added a comment -

        Good catch, Jason Lowe. Updated the patch to include these race conditions as well.

        Show
        jeagles Jonathan Eagles added a comment - Good catch, Jason Lowe . Updated the patch to include these race conditions as well.
        Hide
        jlowe Jason Lowe added a comment -

        Thanks for the patch!

        In general patch looks good, but arguably there are some race conditions that are still missed. If we're going to let the AM register before we see the launched event then we could theoretically see the launched event in other states besides RUNNING as well if the AM is very short lived, such as FINISHING, FAILED, etc. The LAUNCHED event needs to be ignored in those as well.

        Otherwise the patch looks good.

        Show
        jlowe Jason Lowe added a comment - Thanks for the patch! In general patch looks good, but arguably there are some race conditions that are still missed. If we're going to let the AM register before we see the launched event then we could theoretically see the launched event in other states besides RUNNING as well if the AM is very short lived, such as FINISHING, FAILED, etc. The LAUNCHED event needs to be ignored in those as well. Otherwise the patch looks good.
        Hide
        jeagles Jonathan Eagles added a comment -

        Fixed the checkstyle warning.
        Test failure is due to https://issues.apache.org/jira/browse/YARN-7044
        Jason Lowe, can you take a look at this approach?

        Show
        jeagles Jonathan Eagles added a comment - Fixed the checkstyle warning. Test failure is due to https://issues.apache.org/jira/browse/YARN-7044 Jason Lowe , can you take a look at this approach?
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 16s Docker mode activated.
              Prechecks
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
              trunk Compile Tests
        +1 mvninstall 16m 17s trunk passed
        +1 compile 0m 37s trunk passed
        +1 checkstyle 0m 31s trunk passed
        +1 mvnsite 0m 39s trunk passed
        +1 findbugs 1m 3s trunk passed
        +1 javadoc 0m 22s trunk passed
              Patch Compile Tests
        +1 mvninstall 0m 33s the patch passed
        +1 compile 0m 33s the patch passed
        +1 javac 0m 33s the patch passed
        -0 checkstyle 0m 26s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 197 unchanged - 1 fixed = 198 total (was 198)
        +1 mvnsite 0m 36s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 6s the patch passed
        +1 javadoc 0m 20s the patch passed
              Other Tests
        -1 unit 43m 59s hadoop-yarn-server-resourcemanager in the patch failed.
        +1 asflicense 0m 14s The patch does not generate ASF License warnings.
        68m 49s



        Reason Tests
        Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:14b5c93
        JIRA Issue YARN-2416
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12882651/YARN-2416.001.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 67b41019b168 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 2d105a2
        Default Java 1.8.0_144
        findbugs v3.1.0-RC1
        checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/16995/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        unit https://builds.apache.org/job/PreCommit-YARN-Build/16995/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/16995/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/16995/console
        Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 16s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.       trunk Compile Tests +1 mvninstall 16m 17s trunk passed +1 compile 0m 37s trunk passed +1 checkstyle 0m 31s trunk passed +1 mvnsite 0m 39s trunk passed +1 findbugs 1m 3s trunk passed +1 javadoc 0m 22s trunk passed       Patch Compile Tests +1 mvninstall 0m 33s the patch passed +1 compile 0m 33s the patch passed +1 javac 0m 33s the patch passed -0 checkstyle 0m 26s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 197 unchanged - 1 fixed = 198 total (was 198) +1 mvnsite 0m 36s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 6s the patch passed +1 javadoc 0m 20s the patch passed       Other Tests -1 unit 43m 59s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 14s The patch does not generate ASF License warnings. 68m 49s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue YARN-2416 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12882651/YARN-2416.001.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 67b41019b168 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 2d105a2 Default Java 1.8.0_144 findbugs v3.1.0-RC1 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/16995/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/16995/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/16995/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/16995/console Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        jeagles Jonathan Eagles added a comment -

        Assumptions for this test. Design is to simply allow REGISTERED before and LAUNCHED events arrive out of order.
        Considerations

        • Launched transition populates the launch am end time, calculates and publishes the AM launch delay metric, updates the diagnostics (overwritten by the REGISTERED event), and starts the AM liveliness monitor. This patch calls the AM launched transition manually from the registered event. This prevents duplication of the code, but does fundamentally separate the code from the Registration transition. Another way to do this would be to calculate the cluster metrics AM Register Delay as 0, and when the LAUNCHED event comes in to not update the diagnostics if we are in the running state.
        Show
        jeagles Jonathan Eagles added a comment - Assumptions for this test. Design is to simply allow REGISTERED before and LAUNCHED events arrive out of order. Considerations Launched transition populates the launch am end time, calculates and publishes the AM launch delay metric, updates the diagnostics (overwritten by the REGISTERED event), and starts the AM liveliness monitor. This patch calls the AM launched transition manually from the registered event. This prevents duplication of the code, but does fundamentally separate the code from the Registration transition. Another way to do this would be to calculate the cluster metrics AM Register Delay as 0, and when the LAUNCHED event comes in to not update the diagnostics if we are in the running state.
        Hide
        jlowe Jason Lowe added a comment -

        Bumping the priority of this since easily leads to infinite job hangs:

        1. AM races ahead of launch event and tries to register.
        2. ApplicationMasterService ends up completing the registration from its point of view and fires event to app state machine
        3. App state machine still thinks it hasn't launched yet, so it throws the stacktrace listed above and remains in the ALLOCATED state.
        4. Eventually launch even arrives and we move to the LAUNCHED state.
        5. We can never move to the RUNNING state since the app and ApplicationMasterService both think we've registered but the app state machine does not.
        6. App tries to unregister. Since ApplicationMasterService thinks it has registered, it tickles the expiration watchdog and sends the event to the app state machine. App is told unregister was false because it hasn't been saved in a terminal state
        7. App state machine barfs and ignores unregister event since app is not RUNNING (and never will be)
        8. Rinse, repeat unregistration forever until app is manually killed. The RM will never expire the app because ApplicationMasterService keeps tickling the watchdog on each unregister attempt.

        Now AM.register and Event.Launched is asynchronous. If need to change it to be synchronous?

        This is not easy to do. There could be a network cut between the RM and the NM or a large GC event just as the NM tries to send the RPC response to the RM, but meanwhile the AM proceeds to launch and ends up registering to the RM just as the NM RPC response arrives. The only way to enforce ordering there is to temporarily reject AM registrations until the state machine has seen the launch event. I'd much prefer to just fix the state machine to expect these two events to race inherently as originally suggested.

        Show
        jlowe Jason Lowe added a comment - Bumping the priority of this since easily leads to infinite job hangs: AM races ahead of launch event and tries to register. ApplicationMasterService ends up completing the registration from its point of view and fires event to app state machine App state machine still thinks it hasn't launched yet, so it throws the stacktrace listed above and remains in the ALLOCATED state. Eventually launch even arrives and we move to the LAUNCHED state. We can never move to the RUNNING state since the app and ApplicationMasterService both think we've registered but the app state machine does not. App tries to unregister. Since ApplicationMasterService thinks it has registered, it tickles the expiration watchdog and sends the event to the app state machine. App is told unregister was false because it hasn't been saved in a terminal state App state machine barfs and ignores unregister event since app is not RUNNING (and never will be) Rinse, repeat unregistration forever until app is manually killed. The RM will never expire the app because ApplicationMasterService keeps tickling the watchdog on each unregister attempt. Now AM.register and Event.Launched is asynchronous. If need to change it to be synchronous? This is not easy to do. There could be a network cut between the RM and the NM or a large GC event just as the NM tries to send the RPC response to the RM, but meanwhile the AM proceeds to launch and ends up registering to the RM just as the NM RPC response arrives. The only way to enforce ordering there is to temporarily reject AM registrations until the state machine has seen the launch event. I'd much prefer to just fix the state machine to expect these two events to race inherently as originally suggested.
        Hide
        Feng Yuan Feng Yuan added a comment -

        Hi Jian Fang,Junping Du:
        Thanks for your find,In my humble opinion:
        Now AM.register and Event.Launched is asynchronous.
        If need to change it to be synchronous?
        Hope suggestions!

        Show
        Feng Yuan Feng Yuan added a comment - Hi Jian Fang , Junping Du : Thanks for your find,In my humble opinion: Now AM.register and Event.Launched is asynchronous. If need to change it to be synchronous? Hope suggestions!
        Hide
        djp Junping Du added a comment -

        Thanks for identifying and reporting the issue, Jian Fang! Add state transition from ALLOCATED to RUNNING. Mind deliver a fix for it?

        Show
        djp Junping Du added a comment - Thanks for identifying and reporting the issue, Jian Fang ! Add state transition from ALLOCATED to RUNNING. Mind deliver a fix for it?
        Hide
        john.jian.fang Jian Fang added a comment -

        Seems for some reason, the RPC call for startContainers(allRequests) in AMLauncher was blocked. More interesting thing is that the second retry to launch the application master failed for the same reason and the responses of the two startContainers() calls for both AM launches came back at the same time as shown in the above log.

        Since the REGISTERED event is a good indication that the AM container was launched successfully, can we add state transition from ALLOCATED to RUNNING or do two state transitions, i.e., from ALLOCATED to LAUNCHED and then from LAUNCHED to RUNNING, in this case?

        Show
        john.jian.fang Jian Fang added a comment - Seems for some reason, the RPC call for startContainers(allRequests) in AMLauncher was blocked. More interesting thing is that the second retry to launch the application master failed for the same reason and the responses of the two startContainers() calls for both AM launches came back at the same time as shown in the above log. Since the REGISTERED event is a good indication that the AM container was launched successfully, can we add state transition from ALLOCATED to RUNNING or do two state transitions, i.e., from ALLOCATED to LAUNCHED and then from LAUNCHED to RUNNING, in this case?
        Hide
        john.jian.fang Jian Fang added a comment -

        Here is the log for the events and state transitions.

        2014-07-05 08:57:41,748 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1404549222428_0001_01_000001 Container Transitioned from NEW to ALLOCATED

        2014-07-05 08:57:41,760 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1404549222428_0001_01_000001 Container Transitioned from ALLOCATED to ACQUIRED

        2014-07-05 08:57:41,833 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command to launch container container_1404549222428_0001_01_000001 : $JAVA_HOME/bin/java -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Xmx2048m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr

        2014-07-05 08:57:42,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1404549222428_0001_01_000001 Container Transitioned from ACQUIRED to RUNNING
        2014-07-05 08:58:54,290 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1404549222428_0001_01_000001 Container Transitioned from RUNNING to KILLED

        2014-07-05 08:58:54,290 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Completed container: container_1404549222428_0001_01_000001 in state: KILLED event:KILL

        2014-07-05 08:58:54,394 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1404549222428_0001_000002 State change from SCHEDULED to ALLOCATED_SAVING
        2014-07-05 08:58:54,394 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1404549222428_0001_000002 State change from ALLOCATED_SAVING to ALLOCATED
        2014-07-05 08:58:54,395 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching masterappattempt_1404549222428_0001_000002

        2014-07-05 08:58:54,397 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command to launch container container_1404549222428_0001_02_000001 : $JAVA_HOME/bin/java -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Xmx2048m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr

        2014-07-05 08:58:55,396 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1404549222428_0001_02_000001 Container Transitioned from ACQUIRED to RUNNING

        2014-07-05 08:59:05,020 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AM registration appattempt_1404549222428_0001_000002
        2014-07-05 08:59:05,021 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop IP=10.198.10.201 OPERATION=Register App Master TARGET=ApplicationMasterService RESULT=SUCCESS APPID=application_1404549222428_0001 APPATTEMPTID=appattempt_1404549222428_0001_000002

        2014-07-05 08:59:12,653 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done launching container Container: [ContainerId: container_1404549222428_0001_01_000001, NodeId: ip-10-198-22-164.us-west-1.compute.internal:9103, NodeHttpAddress: ip-10-198-22-164.us-west-1.compute.internal:9035, Resource: <memory:3378, vCores:1>, Priority: 0, Token: Token

        { kind: ContainerToken, service: 10.198.22.164:9103 }

        , ] for AM appattempt_1404549222428_0001_000001
        2014-07-05 08:59:12,653 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done launching container Container: [ContainerId: container_1404549222428_0001_02_000001, NodeId: ip-10-198-10-201.us-west-1.compute.internal:9103, NodeHttpAddress: ip-10-198-10-201.us-west-1.compute.internal:9035, Resource: <memory:3378, vCores:1>, Priority: 0, Token: Token

        { kind: ContainerToken, service: 10.198.10.201:9103 }

        , ] for AM appattempt_1404549222428_0001_000002

        Show
        john.jian.fang Jian Fang added a comment - Here is the log for the events and state transitions. 2014-07-05 08:57:41,748 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1404549222428_0001_01_000001 Container Transitioned from NEW to ALLOCATED 2014-07-05 08:57:41,760 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1404549222428_0001_01_000001 Container Transitioned from ALLOCATED to ACQUIRED 2014-07-05 08:57:41,833 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command to launch container container_1404549222428_0001_01_000001 : $JAVA_HOME/bin/java -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Xmx2048m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr 2014-07-05 08:57:42,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1404549222428_0001_01_000001 Container Transitioned from ACQUIRED to RUNNING 2014-07-05 08:58:54,290 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1404549222428_0001_01_000001 Container Transitioned from RUNNING to KILLED 2014-07-05 08:58:54,290 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Completed container: container_1404549222428_0001_01_000001 in state: KILLED event:KILL 2014-07-05 08:58:54,394 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1404549222428_0001_000002 State change from SCHEDULED to ALLOCATED_SAVING 2014-07-05 08:58:54,394 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1404549222428_0001_000002 State change from ALLOCATED_SAVING to ALLOCATED 2014-07-05 08:58:54,395 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching masterappattempt_1404549222428_0001_000002 2014-07-05 08:58:54,397 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command to launch container container_1404549222428_0001_02_000001 : $JAVA_HOME/bin/java -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Xmx2048m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr 2014-07-05 08:58:55,396 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1404549222428_0001_02_000001 Container Transitioned from ACQUIRED to RUNNING 2014-07-05 08:59:05,020 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AM registration appattempt_1404549222428_0001_000002 2014-07-05 08:59:05,021 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop IP=10.198.10.201 OPERATION=Register App Master TARGET=ApplicationMasterService RESULT=SUCCESS APPID=application_1404549222428_0001 APPATTEMPTID=appattempt_1404549222428_0001_000002 2014-07-05 08:59:12,653 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done launching container Container: [ContainerId: container_1404549222428_0001_01_000001, NodeId: ip-10-198-22-164.us-west-1.compute.internal:9103, NodeHttpAddress: ip-10-198-22-164.us-west-1.compute.internal:9035, Resource: <memory:3378, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.198.22.164:9103 } , ] for AM appattempt_1404549222428_0001_000001 2014-07-05 08:59:12,653 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done launching container Container: [ContainerId: container_1404549222428_0001_02_000001, NodeId: ip-10-198-10-201.us-west-1.compute.internal:9103, NodeHttpAddress: ip-10-198-10-201.us-west-1.compute.internal:9035, Resource: <memory:3378, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.198.10.201:9103 } , ] for AM appattempt_1404549222428_0001_000002

          People

          • Assignee:
            jeagles Jonathan Eagles
            Reporter:
            john.jian.fang Jian Fang
          • Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development