Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5139 [Umbrella] Move YARN scheduler towards global scheduler
  3. YARN-10352

Skip schedule on not heartbeated nodes in Multi Node Placement

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM Active Nodes will be still having those stopped nodes until NM Liveliness Monitor Expires after configured timeout (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, Multi Node Placement assigns the containers on those nodes. They need to exclude the nodes which has not heartbeated for configured heartbeat interval (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to Asynchronous Capacity Scheduler Threads. (CapacityScheduler#shouldSkipNodeSchedule)

      Repro:

      1. Enable Multi Node Placement (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery Enabled (yarn.node.recovery.enabled)

      2. Have only one NM running say worker0

      3. Stop worker0 and start any other NM say worker1

      4. Submit a sleep job. The containers will timeout as assigned to stopped NM worker0.

      Attachments

        1. YARN-10352.009.patch
          29 kB
          Qi Zhu
        2. YARN-10352-001.patch
          11 kB
          Prabhu Joseph
        3. YARN-10352-002.patch
          15 kB
          Prabhu Joseph
        4. YARN-10352-003.patch
          15 kB
          Prabhu Joseph
        5. YARN-10352-004.patch
          17 kB
          Prabhu Joseph
        6. YARN-10352-005.patch
          18 kB
          Prabhu Joseph
        7. YARN-10352-006.patch
          18 kB
          Prabhu Joseph
        8. YARN-10352-007.patch
          22 kB
          Prabhu Joseph
        9. YARN-10352-008.patch
          23 kB
          Prabhu Joseph
        10. YARN-10352-010.patch
          29 kB
          Qi Zhu

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            prabhujoseph Prabhu Joseph
            prabhujoseph Prabhu Joseph
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment