Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-7790

Improve Capacity Scheduler Async Scheduling to better handle node failures

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1.0, 3.0.1
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      This is not a new issue but async scheduling makes it worse:

      In sync scheduling, if an AM container allocated to a node, it assumes node just heartbeat to RM, and AM launcher will connect NM to launch the container. Even though it is possible that NM crashes after the heartbeat, which causes AM hangs for a while. But it is related rare.

      In async scheduling world, multiple AM containers can be placed on a problematic NM, which could cause application hangs easily. Discussed with Sunil Govindan and Jian He , we need one fix:

      When async scheduling enabled:

      • Skip node which missed X node heartbeat.

      And in addition, it's better to reduce wait time by setting following configs to earlier fail a container being launched at an NM with connectivity issue.

      RetryPolicy retryPolicy =
          createRetryPolicy(conf,
            YarnConfiguration.CLIENT_NM_CONNECT_MAX_WAIT_MS,
            YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS,
            YarnConfiguration.CLIENT_NM_CONNECT_RETRY_INTERVAL_MS,
            YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS);
      

      The second part is not covered by the patch.

        Attachments

        1. YARN-7790.003.patch
          24 kB
          Wangda Tan
        2. YARN-7790.002.patch
          23 kB
          Wangda Tan
        3. YARN-7790.001.patch
          24 kB
          Wangda Tan

          Activity

            People

            • Assignee:
              leftnoteasy Wangda Tan
              Reporter:
              ssathish@hortonworks.com Sumana Sathish
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: