Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8575

Avoid committing allocation proposal to unavailable nodes in async scheduling

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.2.0, 3.1.2
    • Component/s: capacityscheduler
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Recently we found a new error as follows: 

      ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: node to unreserve doesn't exist, nodeid: host1:45454
      

      Reproduce this problem:
      (1) Create a reserve proposal for app1 on node1
      (2) node1 is successfully decommissioned and removed from node tracker
      (3) Try to commit this outdated reserve proposal, it will be accepted and applied.
      This error may be occurred after decommissioning some NMs. The application who print the error log will always have a reserved container on non-exist (decommissioned) NM and the pending request will never be satisfied.
      To solve this problem, scheduler should check node state in FiCaSchedulerApp#accept to avoid committing outdated proposals on unusable nodes.

        Attachments

        1. YARN-8575.002.patch
          13 kB
          Tao Yang
        2. YARN-8575.001.patch
          13 kB
          Tao Yang

          Issue Links

            Activity

              People

              • Assignee:
                Tao Yang Tao Yang
                Reporter:
                Tao Yang Tao Yang
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: