Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5063

Fail to launch AM continuously on a lost NM

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • resourcemanager
    • None

    Description

      If a NM node shuts down, RM will not mark it as LOST until liveness monitor finds it timeout. However before that, RM might continuously allocate AM on that NM.

      We found this case in our cluster: RM continuously allocated a same AM on a lost NM before RM found it lost, and AMLauncher always failed because it could not connect to the lost NM. To solve the problem, we could add the NM to AM blacklist if RM failed to launch it.

      Attachments

        Issue Links

          Activity

            People

              hex108 Jun Gong
              hex108 Jun Gong
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: