Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-933

Potential InvalidStateTransitonException: Invalid event: LAUNCHED at FINAL_SAVING

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.5-alpha
    • Fix Version/s: 2.7.0
    • Component/s: resourcemanager
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      am max retries configured as 3 at client and RM side.

      Step 1: Install cluster with NM on 2 Machines
      Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But using Hostname should fail
      Step 3: Execute a job
      Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done , connection loss happened.

      Observation :
      ==========
      After AppAttempt_1 has moved to failed state ,release of container for AppAttempt_1 and Application removal are successful. New AppAttempt_2 is sponed.

      1. Then again retry for AppAttempt_1 happens.
      2. Again RM side it is trying to launch AppAttempt_1, hence fails with InvalidStateTransitonException
      3. Client got exited after AppAttempt_1 is been finished [But actually job is still running ], while the appattempts configured is 3 and rest appattempts are all sponed and running.

      RMLogs:
      ======
      2013-07-17 16:22:51,013 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1373952096466_0056_000001 State change from SCHEDULED to ALLOCATED
      2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s); maxRetries=45
      2013-07-17 16:36:07,091 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:container_1373952096466_0056_01_000001 Timed out after 600 secs
      2013-07-17 16:36:07,093 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1373952096466_0056_01_000001 Container Transitioned from ACQUIRED to EXPIRED

      2013-07-17 16:36:07,093 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering appattempt_1373952096466_0056_000002

      2013-07-17 16:36:07,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1373952096466_0056_000001 is done. finalState=FAILED
      2013-07-17 16:36:07,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Application removed - appId: application_1373952096466_0056 user: Rex leaf-queue of parent: root #applications: 35

      2013-07-17 16:36:07,132 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application Submission: appattempt_1373952096466_0056_000002,
      2013-07-17 16:36:07,138 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1373952096466_0056_000002 State change from SUBMITTED to SCHEDULED

      2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 38 time(s); maxRetries=45
      2013-07-17 16:38:36,203 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: host-10-18-40-15/10.18.40.59:8048. Already tried 44 time(s); maxRetries=45
      2013-07-17 16:38:56,207 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error launching appattempt_1373952096466_0056_000001. Got exception: java.lang.reflect.UndeclaredThrowableException
      2013-07-17 16:38:56,207 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state
      org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: LAUNCH_FAILED at FAILED
      at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
      at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
      at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445)
      at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:630)
      at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:99)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:495)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:476)
      at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
      at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
      at java.lang.Thread.run(Thread.java:662)

      Client Logs
      ========
      Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=host-10-18-40-15/10.18.40.59:8020]
      at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:573)
      at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
      2013-07-17 16:37:05,987 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:Rex (auth:SIMPLE) cause:org.apache.hadoop.net.ConnectTimeoutException: Call From HOST-10-18-91-55/10.18.40.57 to host-10-18-40-15:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=host-10-18-40-15/10.18.40.59:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout

        Attachments

        1. 0001-YARN-933.patch
          4 kB
          Rohith Sharma K S
        2. 0001-YARN-933.patch
          4 kB
          Rohith Sharma K S
        3. 0004-YARN-933.patch
          4 kB
          Rohith Sharma K S
        4. YARN-933.3.patch
          4 kB
          Jian He
        5. YARN-933.patch
          3 kB
          Rohith Sharma K S

          Activity

            People

            • Assignee:
              rohithsharma Rohith Sharma K S
              Reporter:
              andreina J.Andreina
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: