Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-9413

Queue resource leak after app fail for CapacityScheduler

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      To reproduce this problem:

      1. Submit an app which is configured to keep containers across app attempts and should fail after AM finished at first time (am-max-attempts=1).
      2. App is started with 2 containers running on NM1 node.
      3. Fail the AM of the application with PREEMPTED exit status which should not count towards max attempt retry but app will fail immediately.
      4. Used resource of this queue leaks after app fail.

      The root cause is the inconsistency of handling app attempt failure between RMAppAttemptImpl$BaseFinalTransition#transition and RMAppImpl$AttemptFailedTransition#transition:

      1. After app fail, RMAppFailedAttemptEvent will be sent in RMAppAttemptImpl$BaseFinalTransition#transition, if exit status of AM container is PREEMPTED/ABORTED/DISKS_FAILED/KILLED_BY_RESOURCEMANAGER, it will not count towards max attempt retry, so that it will send AppAttemptRemovedSchedulerEvent with keepContainersAcrossAppAttempts=true and RMAppFailedAttemptEvent with transferStateFromPreviousAttempt=true.
      2. RMAppImpl$AttemptFailedTransition#transition handle RMAppFailedAttemptEvent and will fail the app if its max app attempts is 1.
      3. CapacityScheduler handles AppAttemptRemovedSchedulerEvent in CapcityScheduler#doneApplicationAttempt, it will skip killing and calling completion process for containers belong to this app, so that queue resource leak happens.

      Attachments

        1. YARN-9413.branch-3.0.001.patch
          10 kB
          Tao Yang
        2. YARN-9413.003.patch
          12 kB
          Tao Yang
        3. YARN-9413.002.patch
          6 kB
          Tao Yang
        4. YARN-9413.001.patch
          6 kB
          Tao Yang
        5. image-2019-03-29-10-47-47-953.png
          85 kB
          Tao Yang

        Activity

          People

            Tao Yang Tao Yang
            Tao Yang Tao Yang
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: