Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-2359

Application hangs when it fails to launch AM container

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • None
    • 2.6.0
    • resourcemanager
    • None
    • Reviewed

    Description

      Application is hung without timeout and retry after DNS/network is down.
      It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container.
      The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state:
      RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
      The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated when the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED.
      The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication.

      To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED
      add the following code in StateMachineFactory:

      .addTransition(RMAppAttemptState.SCHEDULED, 
                RMAppAttemptState.FINAL_SAVING,
                RMAppAttemptEventType.CONTAINER_FINISHED,
                new FinalSavingTransition(
                  new AMContainerCrashedBeforeRunningTransition(), 
                  RMAppAttemptState.FAILED))

      Attachments

        1. YARN-2359.002.patch
          4 kB
          Zhihai Xu
        2. YARN-2359.001.patch
          4 kB
          Zhihai Xu
        3. YARN-2359.000.patch
          1 kB
          Zhihai Xu

        Activity

          People

            zxu Zhihai Xu
            zxu Zhihai Xu
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: