Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-6959

RM may allocate wrong AM Container for new attempt

    XMLWordPrintableJSON

Details

    • Reviewed
    • ResourceManager will now record ResourceRequests from different attempts into different objects.
    • Patch, Important

    Description

      Issue Summary:
      Previous attempt ResourceRequest may be recorded into current attempt ResourceRequests. These mis-recorded ResourceRequests may confuse AM Container Request and Allocation for current attempt.

      Issue Pipeline:

      // Executing precondition check for the incoming attempt id.
      ApplicationMasterService.allocate() ->
      
      scheduler.allocate(attemptId, ask, ...) ->
      
      // Previous precondition check for the attempt id may be outdated here, 
      // i.e. the currentAttempt may not be the corresponding attempt of the attemptId.
      // Such as the attempt id is corresponding to the previous attempt.
      currentAttempt = scheduler.getApplicationAttempt(attemptId) ->
      
      // Previous attempt ResourceRequest may be recorded into current attempt ResourceRequests
      currentAttempt.updateResourceRequests(ask) ->
      
      // RM may allocate wrong AM Container for the current attempt, because its ResourceRequests
      // may come from previous attempt which can be any ResourceRequests previous AM asked
      // and there is not matching logic for the original AM Container ResourceRequest and 
      // the returned amContainerAllocation below.
      AMContainerAllocatedTransition.transition(...) ->
      amContainerAllocation = scheduler.allocate(currentAttemptId, ...)
      

      Patch Correctness:
      Because after this Patch, RM will definitely record ResourceRequests from different attempt into different objects of SchedulerApplicationAttempt.AppSchedulingInfo.
      So, even if RM still record ResourceRequests from old attempt at any time, these ResourceRequests will be recorded in old AppSchedulingInfo object which will not impact current attempt's resource requests and allocation.

      Concerns:
      The getApplicationAttempt function in AbstractYarnScheduler is so confusing, we should better rename it to getCurrentApplicationAttempt. And reconsider whether there are any other bugs related to getApplicationAttempt.

      Attachments

        1. YARN-6959.005.patch
          11 kB
          Yuqi Wang
        2. YARN-6959.yarn_rm.log.zip
          4.04 MB
          Yuqi Wang
        3. YARN-6959.yarn_nm.log.zip
          1.74 MB
          Yuqi Wang
        4. YARN-6959-branch-2.8.001.patch
          11 kB
          Yuqi Wang
        5. YARN-6959-branch-2.7.005.patch
          11 kB
          Yuqi Wang
        6. YARN-6959-branch-2.7.006.patch
          11 kB
          Yuqi Wang
        7. YARN-6959-branch-2.8.002.patch
          11 kB
          Yuqi Wang

        Activity

          People

            yqwang Yuqi Wang
            yqwang Yuqi Wang
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: