Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-10895

ContainerIdPBImpl objects still can be leaked in RMNodeImpl.completedContainers

    XMLWordPrintableJSON

Details

    Description

      YARN-10467 fixed ContainerIdPBImpl Object Leakage in RMNodeImpl.completedContainers.

      After applying YARN-10467 patch and operating cluster with large number of nodes, we found similar heap leakage still exists.

      In heap dump which are dumped after failover, (so it is not active RM) about 4.5G is used by ContainerIDPBImpl on RMNodeImpl.completedContainers.

       

      There are two cases.

       

      1. Apps with 'KeepContainersAcrossApplicationAttempts'  is not cleared when they are failed

      Even though 'KeepContainersAcrossApplicationAttempts' is set, we should clear RMAppAttemptImpl.justFinishedContainers.

      If app attempt is failed and retried by next attempt, we may not need to clear RMAppAttemptImpl.justFinishedContainers because related ContainerIDPBImpl will be handed over to next attempts and eventually cleared.

      However, when app is failed, there is no next attempt and heap leakage occur.

      (We found this case when Yarn Service Application failed over multiple attempts because of OOM in AM)

       

      2. Apps is killed explicitly by user

      When app is killed by user by 'yarn application -kill' CLI interface or WebUI interface,  RMAppAttemptImpl.amContainerFinished is not called because app and app attempt state is already changed.

       

      To handle this, we added sendFinishedContainersToNMs for each RMAppAttemptImpl.finishedContainersSentToAm, RMAppAttemptImpl.justFinishedContainers when Attempt is set to 'KILLED'

       

      We found and patched our cluster on 3.1.2 but it seems trunk still has the same problem.

      I attached patch based on the trunk.

       

      Thanks!

      Attachments

        1. YARN-10895.001.patch
          4 kB
          Jeongin Ju

        Issue Links

          Activity

            People

              Unassigned Unassigned
              acedia28 Jeongin Ju
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m