Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-12317

Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 2.8.0, 3.0.0-alpha1
    • None
    • None
    • Reviewed

    Description

      On a debian machine we have seen node manager recovery of containers fail because the signal syntax for process group may not work. We see errors in checking if process is alive during container recovery which causes the container to be declared as LOST (154) on a NodeManager restart.

      The application will fail with error. The attempts are not retried.

      Application application_1439244348718_0001 failed 1 times due to Attempt recovered after RM restartAM Container for appattempt_1439244348718_0001_000001 exited with exitCode: 154
      

      Attachments

        1. YARN-4046.002.patch
          3 kB
          Anubhav Dhoot
        2. YARN-4046.002.patch
          3 kB
          Anubhav Dhoot
        3. YARN-4096.001.patch
          3 kB
          Anubhav Dhoot

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            adhoot Anubhav Dhoot
            adhoot Anubhav Dhoot
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment