Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-9413

Queue resource leak after app fail for CapacityScheduler

    XMLWordPrintableJSON

    Details

    • Hadoop Flags:
      Reviewed

      Description

      To reproduce this problem:

      1. Submit an app which is configured to keep containers across app attempts and should fail after AM finished at first time (am-max-attempts=1).
      2. App is started with 2 containers running on NM1 node.
      3. Fail the AM of the application with PREEMPTED exit status which should not count towards max attempt retry but app will fail immediately.
      4. Used resource of this queue leaks after app fail.

      The root cause is the inconsistency of handling app attempt failure between RMAppAttemptImpl$BaseFinalTransition#transition and RMAppImpl$AttemptFailedTransition#transition:

      1. After app fail, RMAppFailedAttemptEvent will be sent in RMAppAttemptImpl$BaseFinalTransition#transition, if exit status of AM container is PREEMPTED/ABORTED/DISKS_FAILED/KILLED_BY_RESOURCEMANAGER, it will not count towards max attempt retry, so that it will send AppAttemptRemovedSchedulerEvent with keepContainersAcrossAppAttempts=true and RMAppFailedAttemptEvent with transferStateFromPreviousAttempt=true.
      2. RMAppImpl$AttemptFailedTransition#transition handle RMAppFailedAttemptEvent and will fail the app if its max app attempts is 1.
      3. CapacityScheduler handles AppAttemptRemovedSchedulerEvent in CapcityScheduler#doneApplicationAttempt, it will skip killing and calling completion process for containers belong to this app, so that queue resource leak happens.

        Attachments

        1. image-2019-03-29-10-47-47-953.png
          85 kB
          Tao Yang
        2. YARN-9413.001.patch
          6 kB
          Tao Yang
        3. YARN-9413.002.patch
          6 kB
          Tao Yang
        4. YARN-9413.003.patch
          12 kB
          Tao Yang
        5. YARN-9413.branch-3.0.001.patch
          10 kB
          Tao Yang

          Activity

            People

            • Assignee:
              Tao Yang Tao Yang
              Reporter:
              Tao Yang Tao Yang
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: