Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3655

FairScheduler: potential livelock due to maxAMShare limitation and container reservation

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 2.7.0
    • 2.8.0, 3.0.0-alpha1
    • fairscheduler
    • None
    • Reviewed

    Description

      FairScheduler: potential livelock due to maxAMShare limitation and container reservation.
      If a node is reserved by an application, all the other applications don't have any chance to assign a new container on this node, unless the application which reserves the node assigns a new container on this node or releases the reserved container on this node.
      The problem is if an application tries to call assignReservedContainer and fail to get a new container due to maxAMShare limitation, it will block all other applications to use the nodes it reserves. If all other running applications can't release their AM containers due to being blocked by these reserved containers. A livelock situation can happen.
      The following is the code at FSAppAttempt#assignContainer which can cause this potential livelock.

          // Check the AM resource usage for the leaf queue
          if (!isAmRunning() && !getUnmanagedAM()) {
            List<ResourceRequest> ask = appSchedulingInfo.getAllResourceRequests();
            if (ask.isEmpty() || !getQueue().canRunAppAM(
                ask.get(0).getCapability())) {
              if (LOG.isDebugEnabled()) {
                LOG.debug("Skipping allocation because maxAMShare limit would " +
                    "be exceeded");
              }
              return Resources.none();
            }
          }
      

      To fix this issue, we can unreserve the node if we can't allocate the AM container on the node due to Max AM share limitation and the node is reserved by the application.

      Attachments

        1. YARN-3655.004.patch
          27 kB
          Zhihai Xu
        2. YARN-3655.003.patch
          28 kB
          Zhihai Xu
        3. YARN-3655.002.patch
          16 kB
          Zhihai Xu
        4. YARN-3655.001.patch
          9 kB
          Zhihai Xu
        5. YARN-3655.000.patch
          1 kB
          Zhihai Xu

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            zxu Zhihai Xu Assign to me
            zxu Zhihai Xu
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment