Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-9237

NM should ignore sending finished apps to RM during RM fail-over

VotersStop watchingWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.0.4, 3.3.0, 3.2.1, 3.1.3
    • yarn
    • None
    • Reviewed

    Description

      I found a lot of following log in active RM log file after doing failover RM

      2019-01-24 15:43:58,999 WARN org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Cannot get RMApp by appId=application_1542178952162_34746156, just added it to finishedApplications list for cleanup
      .....
      

      I looked forward RM logs and find this app had finished before hours

      2019-01-23 21:49:55,683 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1542178952162_34746156_000001 State change from FINAL_SAVING to FINISHING
      

      The reason of RM prints " Cannot get RMApp by appId" is as follows:
      1. RM failover
      2. NM reports all running apps to RM in register request
      3. The running apps are from NMContext, some apps may already finished
      4. In my cluster, yarn.log-aggregation-enable=false, yarn.nodemanager.log.retain-seconds=86400(1day), so app is kept in NMContext before app has finished for 24 hours
      5. My Yarn cluster runs 50k apps per day and 7k nodes, and NM will report many finished apps to RM.

      Attachments

        1. YARN-9237.002.patch
          1 kB
          Jiandan Yang
        2. YARN-9237.001.patch
          1 kB
          Jiandan Yang

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            yangjiandan Jiandan Yang
            yangjiandan Jiandan Yang
            Votes:
            0 Vote for this issue
            Watchers:
            5 Stop watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment