Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5296

NMs going OutOfMemory because ContainerMetrics leak in ContainerMonitorImpl

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.9.0
    • Fix Version/s: 2.9.0, 3.0.0-alpha1
    • Component/s: nodemanager
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Ran tests in following manner,
      1. Run GridMix of 768 sequestionally around 17 times to execute about 12.9K apps.
      2. After 4-5hrs take Check NM Heap using Memory Analyser. It report around 96% Heap is being used my ContainerMetrics
      3. Run 7 more GridMix run for have around 18.2apps ran in total. Again check NM heap using Memory Analyser again 96% heap is being used by ContainerMetrics.
      4. Start one more grimdmix run, while run going on , NMs started going down with OOM, around running 18.7K+, On analysing NM heap using Memory analyser, OOM was caused by ContainerMetrics

      1. YARN-5296-v2.1.patch
        3 kB
        Junping Du
      2. YARN-5296-v2.patch
        3 kB
        Junping Du
      3. after v2 fix.png
        435 kB
        Junping Du
      4. before v2 fix.png
        455 kB
        Junping Du
      5. YARN-5296.patch
        2 kB
        Junping Du

        Issue Links

          Activity

          Hide
          karams Karam Singh added a comment -

          From offline discussion with Junping Du

          Root cause is in YARN-4811 where we launch tasks (scheduleAtFixedRate) in MutableQuantiles but never get chance to terminate these tasks

          Show
          karams Karam Singh added a comment - From offline discussion with Junping Du Root cause is in YARN-4811 where we launch tasks (scheduleAtFixedRate) in MutableQuantiles but never get chance to terminate these tasks
          Hide
          rajesh.balamohan Rajesh Balamohan added a comment -

          Based on offline conversation with Karam Singh, i have changed assignee to Junping Du.
          \cc Junping Du

          Show
          rajesh.balamohan Rajesh Balamohan added a comment - Based on offline conversation with Karam Singh , i have changed assignee to Junping Du . \cc Junping Du
          Hide
          djp Junping Du added a comment -

          Thanks Karam Singh for reporting the issue. Upload a patch to fix this issue which is to terminate scheduled tasks (RolloverSample) after container finished.

          Show
          djp Junping Du added a comment - Thanks Karam Singh for reporting the issue. Upload a patch to fix this issue which is to terminate scheduled tasks (RolloverSample) after container finished.
          Hide
          djp Junping Du added a comment -
          Show
          djp Junping Du added a comment - CC Jian He
          Hide
          jianhe Jian He added a comment -

          lgtm

          Show
          jianhe Jian He added a comment - lgtm
          Hide
          templedf Daniel Templeton added a comment -

          Thanks for the patch, Junping Du. Looks to me like that patch will stop the memory usage from growing, but it won't free the memory already consumed. It appears the underlying issue is that the ContainerMetricsQuantiles.histogram is never actually cleared.

          Show
          templedf Daniel Templeton added a comment - Thanks for the patch, Junping Du . Looks to me like that patch will stop the memory usage from growing, but it won't free the memory already consumed. It appears the underlying issue is that the ContainerMetricsQuantiles.histogram is never actually cleared.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 36s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          0 mvndep 0m 12s Maven dependency ordering for branch
          +1 mvninstall 7m 42s trunk passed
          +1 compile 8m 4s trunk passed
          +1 checkstyle 1m 22s trunk passed
          +1 mvnsite 1m 21s trunk passed
          +1 mvneclipse 0m 26s trunk passed
          +1 findbugs 2m 0s trunk passed
          +1 javadoc 1m 4s trunk passed
          0 mvndep 0m 12s Maven dependency ordering for patch
          +1 mvninstall 1m 3s the patch passed
          +1 compile 6m 41s the patch passed
          +1 javac 6m 41s the patch passed
          +1 checkstyle 1m 21s the patch passed
          +1 mvnsite 1m 27s the patch passed
          +1 mvneclipse 0m 26s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 2m 20s the patch passed
          +1 javadoc 1m 3s the patch passed
          +1 unit 7m 10s hadoop-common in the patch passed.
          -1 unit 10m 14s hadoop-yarn-server-nodemanager in the patch failed.
          +1 asflicense 0m 21s The patch does not generate ASF License warnings.
          56m 2s



          Reason Tests
          Failed junit tests hadoop.yarn.server.nodemanager.containermanager.TestContainerManager
            hadoop.yarn.server.nodemanager.containermanager.TestContainerManagerRegression
            hadoop.yarn.server.nodemanager.TestNodeManagerResync
            hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch
            hadoop.yarn.server.nodemanager.containermanager.queuing.TestQueuingContainerManager
            hadoop.yarn.server.nodemanager.containermanager.monitor.TestContainerMetrics
            hadoop.yarn.server.nodemanager.containermanager.TestContainerManagerRecovery
            hadoop.yarn.server.nodemanager.containermanager.container.TestContainer



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:85209cc
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12813861/YARN-5296.patch
          JIRA Issue YARN-5296
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 8d4d513e698e 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 9683eab
          Default Java 1.8.0_91
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-YARN-Build/12135/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12135/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12135/testReport/
          modules C: hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: .
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/12135/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 36s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. 0 mvndep 0m 12s Maven dependency ordering for branch +1 mvninstall 7m 42s trunk passed +1 compile 8m 4s trunk passed +1 checkstyle 1m 22s trunk passed +1 mvnsite 1m 21s trunk passed +1 mvneclipse 0m 26s trunk passed +1 findbugs 2m 0s trunk passed +1 javadoc 1m 4s trunk passed 0 mvndep 0m 12s Maven dependency ordering for patch +1 mvninstall 1m 3s the patch passed +1 compile 6m 41s the patch passed +1 javac 6m 41s the patch passed +1 checkstyle 1m 21s the patch passed +1 mvnsite 1m 27s the patch passed +1 mvneclipse 0m 26s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 2m 20s the patch passed +1 javadoc 1m 3s the patch passed +1 unit 7m 10s hadoop-common in the patch passed. -1 unit 10m 14s hadoop-yarn-server-nodemanager in the patch failed. +1 asflicense 0m 21s The patch does not generate ASF License warnings. 56m 2s Reason Tests Failed junit tests hadoop.yarn.server.nodemanager.containermanager.TestContainerManager   hadoop.yarn.server.nodemanager.containermanager.TestContainerManagerRegression   hadoop.yarn.server.nodemanager.TestNodeManagerResync   hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch   hadoop.yarn.server.nodemanager.containermanager.queuing.TestQueuingContainerManager   hadoop.yarn.server.nodemanager.containermanager.monitor.TestContainerMetrics   hadoop.yarn.server.nodemanager.containermanager.TestContainerManagerRecovery   hadoop.yarn.server.nodemanager.containermanager.container.TestContainer Subsystem Report/Notes Docker Image:yetus/hadoop:85209cc JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12813861/YARN-5296.patch JIRA Issue YARN-5296 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 8d4d513e698e 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 9683eab Default Java 1.8.0_91 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/12135/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12135/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12135/testReport/ modules C: hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: . Console output https://builds.apache.org/job/PreCommit-YARN-Build/12135/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          djp Junping Du added a comment -

          Thanks Daniel Templeton for review and comments! I think ContainerMetricsQuantiles (include histogram) will be GCed when container is finished. Isn't it? The memory leak happens here is because launched tasks never get chance to shutdown, so it retain all container metrics and never get chance to release.

          Show
          djp Junping Du added a comment - Thanks Daniel Templeton for review and comments! I think ContainerMetricsQuantiles (include histogram) will be GCed when container is finished. Isn't it? The memory leak happens here is because launched tasks never get chance to shutdown, so it retain all container metrics and never get chance to release.
          Hide
          templedf Daniel Templeton added a comment -

          Looks like it will only be GCed when the container is GCed. Does the NM release its containers promptly after completion?

          Show
          templedf Daniel Templeton added a comment - Looks like it will only be GCed when the container is GCed. Does the NM release its containers promptly after completion?
          Hide
          djp Junping Du added a comment -

          Looks like it will only be GCed when the container is GCed. Does the NM release its containers promptly after completion?

          I think so. If not, it could be another problem that we should address separately.

          Show
          djp Junping Du added a comment - Looks like it will only be GCed when the container is GCed. Does the NM release its containers promptly after completion? I think so. If not, it could be another problem that we should address separately.
          Hide
          templedf Daniel Templeton added a comment -

          OK. Have you tested to see that this patch reduces the amount of heap consumed by metrics per the original problem statement?

          Show
          templedf Daniel Templeton added a comment - OK. Have you tested to see that this patch reduces the amount of heap consumed by metrics per the original problem statement?
          Hide
          djp Junping Du added a comment -

          Not yet. But it is pretty clear that we need to fix things here just like the fix attached from heap dump we get.

          Show
          djp Junping Du added a comment - Not yet. But it is pretty clear that we need to fix things here just like the fix attached from heap dump we get.
          Hide
          jianhe Jian He added a comment -

          I analyzed the memory heap dump together with Junping. This is indeed an issue to be fixed.
          I'd like to commit this today.

          Show
          jianhe Jian He added a comment - I analyzed the memory heap dump together with Junping. This is indeed an issue to be fixed. I'd like to commit this today.
          Hide
          templedf Daniel Templeton added a comment -

          If the fix resolves the issue, then LGTM.

          Show
          templedf Daniel Templeton added a comment - If the fix resolves the issue, then LGTM.
          Hide
          jianhe Jian He added a comment -

          yep, we'll run more testing to verify if the oom is resolved.

          Show
          jianhe Jian He added a comment - yep, we'll run more testing to verify if the oom is resolved.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          ContainerMetrics went into 2.7.0 though the histograms only went into 2.9.0. Does this patch need to go into 2.7.3?

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - ContainerMetrics went into 2.7.0 though the histograms only went into 2.9.0. Does this patch need to go into 2.7.3?
          Hide
          jianhe Jian He added a comment -

          No, this does not need to be in 2.7.3

          Show
          jianhe Jian He added a comment - No, this does not need to be in 2.7.3
          Hide
          djp Junping Du added a comment -

          I just did some tests today. Actually, the first patch involve another issue as scheduler.shutdown() will affect later coming container metrics (exception get thrown) as scheduler is marked as static to share with all objects. In v2 patch, cancel the individual task when container get finished which indeed fix previous OOM issue from jmap dump analysis (attached screenshot).

          Show
          djp Junping Du added a comment - I just did some tests today. Actually, the first patch involve another issue as scheduler.shutdown() will affect later coming container metrics (exception get thrown) as scheduler is marked as static to share with all objects. In v2 patch, cancel the individual task when container get finished which indeed fix previous OOM issue from jmap dump analysis (attached screenshot).
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 19s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          0 mvndep 0m 27s Maven dependency ordering for branch
          +1 mvninstall 6m 18s trunk passed
          +1 compile 6m 33s trunk passed
          +1 checkstyle 1m 19s trunk passed
          +1 mvnsite 1m 20s trunk passed
          +1 mvneclipse 0m 27s trunk passed
          +1 findbugs 1m 59s trunk passed
          +1 javadoc 1m 3s trunk passed
          0 mvndep 0m 11s Maven dependency ordering for patch
          +1 mvninstall 0m 59s the patch passed
          +1 compile 6m 32s the patch passed
          +1 javac 6m 32s the patch passed
          -1 checkstyle 1m 21s root: The patch generated 1 new + 30 unchanged - 0 fixed = 31 total (was 30)
          +1 mvnsite 1m 20s the patch passed
          +1 mvneclipse 0m 26s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 2m 15s the patch passed
          +1 javadoc 1m 2s the patch passed
          +1 unit 7m 1s hadoop-common in the patch passed.
          +1 unit 13m 16s hadoop-yarn-server-nodemanager in the patch passed.
          +1 asflicense 0m 23s The patch does not generate ASF License warnings.
          55m 23s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:85209cc
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12815974/YARN-5296-v2.patch
          JIRA Issue YARN-5296
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 871b0ab1a3b8 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 8b281bc
          Default Java 1.8.0_91
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/12176/artifact/patchprocess/diff-checkstyle-root.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12176/testReport/
          modules C: hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: .
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/12176/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 19s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. 0 mvndep 0m 27s Maven dependency ordering for branch +1 mvninstall 6m 18s trunk passed +1 compile 6m 33s trunk passed +1 checkstyle 1m 19s trunk passed +1 mvnsite 1m 20s trunk passed +1 mvneclipse 0m 27s trunk passed +1 findbugs 1m 59s trunk passed +1 javadoc 1m 3s trunk passed 0 mvndep 0m 11s Maven dependency ordering for patch +1 mvninstall 0m 59s the patch passed +1 compile 6m 32s the patch passed +1 javac 6m 32s the patch passed -1 checkstyle 1m 21s root: The patch generated 1 new + 30 unchanged - 0 fixed = 31 total (was 30) +1 mvnsite 1m 20s the patch passed +1 mvneclipse 0m 26s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 2m 15s the patch passed +1 javadoc 1m 2s the patch passed +1 unit 7m 1s hadoop-common in the patch passed. +1 unit 13m 16s hadoop-yarn-server-nodemanager in the patch passed. +1 asflicense 0m 23s The patch does not generate ASF License warnings. 55m 23s Subsystem Report/Notes Docker Image:yetus/hadoop:85209cc JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12815974/YARN-5296-v2.patch JIRA Issue YARN-5296 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 871b0ab1a3b8 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 8b281bc Default Java 1.8.0_91 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/12176/artifact/patchprocess/diff-checkstyle-root.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12176/testReport/ modules C: hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: . Console output https://builds.apache.org/job/PreCommit-YARN-Build/12176/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          djp Junping Du added a comment -

          Fix checkstyle issue in v2.1 patch.

          Show
          djp Junping Du added a comment - Fix checkstyle issue in v2.1 patch.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 26s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          0 mvndep 0m 11s Maven dependency ordering for branch
          +1 mvninstall 8m 10s trunk passed
          +1 compile 6m 56s trunk passed
          +1 checkstyle 1m 26s trunk passed
          +1 mvnsite 1m 30s trunk passed
          +1 mvneclipse 0m 30s trunk passed
          +1 findbugs 2m 1s trunk passed
          +1 javadoc 1m 3s trunk passed
          0 mvndep 0m 12s Maven dependency ordering for patch
          +1 mvninstall 1m 7s the patch passed
          +1 compile 6m 44s the patch passed
          +1 javac 6m 44s the patch passed
          +1 checkstyle 1m 20s the patch passed
          +1 mvnsite 1m 20s the patch passed
          +1 mvneclipse 0m 26s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 2m 14s the patch passed
          +1 javadoc 1m 3s the patch passed
          +1 unit 7m 52s hadoop-common in the patch passed.
          +1 unit 13m 31s hadoop-yarn-server-nodemanager in the patch passed.
          +1 asflicense 0m 23s The patch does not generate ASF License warnings.
          59m 19s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:85209cc
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12816071/YARN-5296-v2.1.patch
          JIRA Issue YARN-5296
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 63a66e053908 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 8b4b525
          Default Java 1.8.0_91
          findbugs v3.0.0
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12181/testReport/
          modules C: hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: .
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/12181/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 26s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. 0 mvndep 0m 11s Maven dependency ordering for branch +1 mvninstall 8m 10s trunk passed +1 compile 6m 56s trunk passed +1 checkstyle 1m 26s trunk passed +1 mvnsite 1m 30s trunk passed +1 mvneclipse 0m 30s trunk passed +1 findbugs 2m 1s trunk passed +1 javadoc 1m 3s trunk passed 0 mvndep 0m 12s Maven dependency ordering for patch +1 mvninstall 1m 7s the patch passed +1 compile 6m 44s the patch passed +1 javac 6m 44s the patch passed +1 checkstyle 1m 20s the patch passed +1 mvnsite 1m 20s the patch passed +1 mvneclipse 0m 26s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 2m 14s the patch passed +1 javadoc 1m 3s the patch passed +1 unit 7m 52s hadoop-common in the patch passed. +1 unit 13m 31s hadoop-yarn-server-nodemanager in the patch passed. +1 asflicense 0m 23s The patch does not generate ASF License warnings. 59m 19s Subsystem Report/Notes Docker Image:yetus/hadoop:85209cc JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12816071/YARN-5296-v2.1.patch JIRA Issue YARN-5296 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 63a66e053908 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 8b4b525 Default Java 1.8.0_91 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12181/testReport/ modules C: hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: . Console output https://builds.apache.org/job/PreCommit-YARN-Build/12181/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          jianhe Jian He added a comment -

          lgtm, thanks for catching this !

          Show
          jianhe Jian He added a comment - lgtm, thanks for catching this !
          Hide
          jianhe Jian He added a comment -

          Committed to trunk, branch-2, thanks Junping !
          Thanks Daniel Templeton for the review !

          Show
          jianhe Jian He added a comment - Committed to trunk, branch-2, thanks Junping ! Thanks Daniel Templeton for the review !
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #10053 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10053/)
          YARN-5296. NMs going OutOfMemory because ContainerMetrics leak in (jianhe: rev d792a90206e940c31d1048e53dc24ded605788bf)

          • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/metrics2/lib/MutableQuantiles.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainerMetrics.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #10053 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10053/ ) YARN-5296 . NMs going OutOfMemory because ContainerMetrics leak in (jianhe: rev d792a90206e940c31d1048e53dc24ded605788bf) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/metrics2/lib/MutableQuantiles.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainerMetrics.java
          Hide
          jlowe Jason Lowe added a comment -

          I just ran across a ContainerMetrics leak that I thought could be the same issue. However this one occurred on a 2.7-based release and has to do with the names piling up in UniqueNames. See YARN-5341.

          Show
          jlowe Jason Lowe added a comment - I just ran across a ContainerMetrics leak that I thought could be the same issue. However this one occurred on a 2.7-based release and has to do with the names piling up in UniqueNames. See YARN-5341 .
          Hide
          djp Junping Du added a comment -

          Hi Jason Lowe, I think the problem you mentioned above is already fixed in YARN-5190. We may just need a backport to 2.7.3.

          Show
          djp Junping Du added a comment - Hi Jason Lowe , I think the problem you mentioned above is already fixed in YARN-5190 . We may just need a backport to 2.7.3.
          Hide
          djp Junping Du added a comment -

          Forget to mention, another issue could affect NM container metrics leak is we don't backport YARN-1643 which include a very trivial (but important) fix that to call metrics.finished() when container is finished. I think we may need to have a separated JIRA to fix it for 2.7 only.
          In short, with YARN-5190 + minimal YARN-1643, OOM issue on NM for 2.7.3 should go away.

          Show
          djp Junping Du added a comment - Forget to mention, another issue could affect NM container metrics leak is we don't backport YARN-1643 which include a very trivial (but important) fix that to call metrics.finished() when container is finished. I think we may need to have a separated JIRA to fix it for 2.7 only. In short, with YARN-5190 + minimal YARN-1643 , OOM issue on NM for 2.7.3 should go away.
          Hide
          leftnoteasy Wangda Tan added a comment - - edited

          Junping Du/Jian He,
          I'm trying to understand why minimal of YARN-1643 is required for branch-2.7:

          When the STOP_MONITORING_CONTAINER is called, container will be added to containersToBeRemoved, and in the running thread, all containers in the containersToBeRemoved will be called:

          ContainerMetrics.forContainer(
                            containerId, containerMetricsPeriodMs,
                            containerMetricsUnregisterDelayMs).finished();
          

          It seems to me there's no issue here, please comment if you think different.

          Thanks,

          Show
          leftnoteasy Wangda Tan added a comment - - edited Junping Du / Jian He , I'm trying to understand why minimal of YARN-1643 is required for branch-2.7: When the STOP_MONITORING_CONTAINER is called, container will be added to containersToBeRemoved, and in the running thread, all containers in the containersToBeRemoved will be called: ContainerMetrics.forContainer( containerId, containerMetricsPeriodMs, containerMetricsUnregisterDelayMs).finished(); It seems to me there's no issue here, please comment if you think different. Thanks,
          Hide
          djp Junping Du added a comment - - edited

          Wangda Tan, this is actually no need for branch-2.7 - as I discussed this with Jason on HADOOP-13362, this is just a misunderstand caused by different container remove places between branch-2.7 and branch-2. Just forget about my comment above.
          Also, I noticed you reopen YARN-5190 for branch-2.7 which seems duplicated with HADOOP-13362. Can you double check and close it? Thx!

          Show
          djp Junping Du added a comment - - edited Wangda Tan , this is actually no need for branch-2.7 - as I discussed this with Jason on HADOOP-13362 , this is just a misunderstand caused by different container remove places between branch-2.7 and branch-2. Just forget about my comment above. Also, I noticed you reopen YARN-5190 for branch-2.7 which seems duplicated with HADOOP-13362 . Can you double check and close it? Thx!
          Hide
          leftnoteasy Wangda Tan added a comment -

          Already closed it, thanks for explanations!

          Show
          leftnoteasy Wangda Tan added a comment - Already closed it, thanks for explanations!

            People

            • Assignee:
              djp Junping Du
              Reporter:
              karams Karam Singh
            • Votes:
              0 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development