Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-13362

DefaultMetricsSystem leaks the source name when a source unregisters

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.7.2
    • Fix Version/s: 2.7.4
    • Component/s: metrics
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Ran across a nodemanager that was spending most of its time in GC. Upon examination of the heap most of the memory was going to the map of names in org.apache.hadoop.metrics2.lib.UniqueNames. In this case the map had almost 2 million entries. Looking at a few of the map showed entries like "ContainerResource_container_e01_1459548490386_8560138_01_002020", "ContainerResource_container_e01_1459548490386_2378745_01_000410", etc.

      Looks like the ContainerMetrics for each container will cause a unique name to be registered with UniqueNames and the name will never be unregistered.

        Issue Links

          Activity

          Hide
          jlowe Jason Lowe added a comment -

          Looks like this may be a bug in the metrics system rather than container metrics. At a glance it appears that unregistering a metrics source won't remove the name from UniqueNames which would be a bug in HADOOP rather than YARN if that ends up being the case.

          Show
          jlowe Jason Lowe added a comment - Looks like this may be a bug in the metrics system rather than container metrics. At a glance it appears that unregistering a metrics source won't remove the name from UniqueNames which would be a bug in HADOOP rather than YARN if that ends up being the case.
          Hide
          jlowe Jason Lowe added a comment -

          This looks like a problem in DefaultMetricsSystem. There are two UniqueNames instances in that class, mBeanNames and sourceNames. When a ContainerMetrics source unregisters the names are getting cleaned out properly from mBeanNames but not from sourceNames. Each source will register a new name in sourceNames, but the only way to clear them out is by shutting down the DefaultMetricsSystem instance. There needs to be a way to remove elements from sourceNames when a source unregisters.

          I updated the summary based on this analysis, and I'll move the JIRA to the HADOOP project since that's where the fix is needed.

          Show
          jlowe Jason Lowe added a comment - This looks like a problem in DefaultMetricsSystem. There are two UniqueNames instances in that class, mBeanNames and sourceNames. When a ContainerMetrics source unregisters the names are getting cleaned out properly from mBeanNames but not from sourceNames. Each source will register a new name in sourceNames, but the only way to clear them out is by shutting down the DefaultMetricsSystem instance. There needs to be a way to remove elements from sourceNames when a source unregisters. I updated the summary based on this analysis, and I'll move the JIRA to the HADOOP project since that's where the fix is needed.
          Hide
          djp Junping Du added a comment -

          Hi Jason Lowe, as I mentioned in YARN-5296, I think the work you proposed above is already done in YARN-5190. Can you double check on it? Thanks!

          Show
          djp Junping Du added a comment - Hi Jason Lowe , as I mentioned in YARN-5296 , I think the work you proposed above is already done in YARN-5190 . Can you double check on it? Thanks!
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for the pointer! Yes, it appears YARN-5190 fixed this as a side-effect since it added the requisite methods to DefaultMetricsSystem.

          You had mentioned in YARN-5296 that a part of YARN-1643 is needed since metrics.finished() isn't being called, but I'm not sure that's necessary. In the heap dump I didn't see any leak beyond the source names. MetricsSystemImpl.unregisterSource is being called for each container, and once YARN-5190 changes that method to unregister the source name from DefaultMetrics I think we're OK at that point.

          Show
          jlowe Jason Lowe added a comment - Thanks for the pointer! Yes, it appears YARN-5190 fixed this as a side-effect since it added the requisite methods to DefaultMetricsSystem. You had mentioned in YARN-5296 that a part of YARN-1643 is needed since metrics.finished() isn't being called, but I'm not sure that's necessary. In the heap dump I didn't see any leak beyond the source names. MetricsSystemImpl.unregisterSource is being called for each container, and once YARN-5190 changes that method to unregister the source name from DefaultMetrics I think we're OK at that point.
          Hide
          jlowe Jason Lowe added a comment -

          Ah I think I see what you mean. I think if we just took the source name unregistering from YARN-5190 we'd be OK, but if we take the whole patch then it also adds a metrics leak that we need parts of YARN-1643 to fix?

          Show
          jlowe Jason Lowe added a comment - Ah I think I see what you mean. I think if we just took the source name unregistering from YARN-5190 we'd be OK, but if we take the whole patch then it also adds a metrics leak that we need parts of YARN-1643 to fix?
          Hide
          jlowe Jason Lowe added a comment -

          Note that we've not seen the uncaught exception issue described in YARN-5190 on 2.7, probably because 2.7 doesn't have YARN-4906. Seems to me the prudent thing to do is extract the part of YARN-5190 that fixes the issue with sourceNames leaking since it's a small and targeted fix that we'd have high confidence doesn't create new problems. If you agree we can reopen this issue and use it to target a patch to 2.7.

          Show
          jlowe Jason Lowe added a comment - Note that we've not seen the uncaught exception issue described in YARN-5190 on 2.7, probably because 2.7 doesn't have YARN-4906 . Seems to me the prudent thing to do is extract the part of YARN-5190 that fixes the issue with sourceNames leaking since it's a small and targeted fix that we'd have high confidence doesn't create new problems. If you agree we can reopen this issue and use it to target a patch to 2.7.
          Hide
          djp Junping Du added a comment -

          Note that we've not seen the uncaught exception issue described in YARN-5190 on 2.7, probably because 2.7 doesn't have YARN-4906.

          Agree. After YARN-4906, things a bit tricky here is: we are calling ContainerMetrics.forContainer() (and unregister) two times: once in ContainerImpl and the other one in ContainerMonitorImpl. And the fix in YARN-1643 has issue in this case because it will call register a metrics again before calling finish it.
          However, I am still suspecting only backport part of YARN-5190 is enough as I didn't see where we call ContainerMetrics.finish() in 2.7.3. Do I miss anything here?

          Show
          djp Junping Du added a comment - Note that we've not seen the uncaught exception issue described in YARN-5190 on 2.7, probably because 2.7 doesn't have YARN-4906 . Agree. After YARN-4906 , things a bit tricky here is: we are calling ContainerMetrics.forContainer() (and unregister) two times: once in ContainerImpl and the other one in ContainerMonitorImpl. And the fix in YARN-1643 has issue in this case because it will call register a metrics again before calling finish it. However, I am still suspecting only backport part of YARN-5190 is enough as I didn't see where we call ContainerMetrics.finish() in 2.7.3. Do I miss anything here?
          Hide
          djp Junping Du added a comment -

          Ah. I think it is get called in ContainerDoneTransition.
          Sure. Please feel free to reopen this issue (or YARN-5190) and I can backport the fix in YARN-5190.

          Show
          djp Junping Du added a comment - Ah. I think it is get called in ContainerDoneTransition. Sure. Please feel free to reopen this issue (or YARN-5190 ) and I can backport the fix in YARN-5190 .
          Hide
          jlowe Jason Lowe added a comment -

          However, I am still suspecting only backport part of YARN-5190 is enough as I didn't see where we call ContainerMetrics.finish() in 2.7.3. Do I miss anything here?

          It's definitely getting called in practice because after almost 2 million containers on that node the only leak I saw was the metric source names. The bean names and other stuff was getting cleaned up as it should. The metrics are getting cleaned up in the monitoring thread's run() method:

                  // Remove finished containers
                  synchronized (containersToBeRemoved) {
                    for (ContainerId containerId : containersToBeRemoved) {
                      if (containerMetricsEnabled) {
                        ContainerMetrics.forContainer(
                            containerId, containerMetricsPeriodMs,
                            containerMetricsUnregisterDelayMs).finished();
          
          Show
          jlowe Jason Lowe added a comment - However, I am still suspecting only backport part of YARN-5190 is enough as I didn't see where we call ContainerMetrics.finish() in 2.7.3. Do I miss anything here? It's definitely getting called in practice because after almost 2 million containers on that node the only leak I saw was the metric source names. The bean names and other stuff was getting cleaned up as it should. The metrics are getting cleaned up in the monitoring thread's run() method: // Remove finished containers synchronized (containersToBeRemoved) { for (ContainerId containerId : containersToBeRemoved) { if (containerMetricsEnabled) { ContainerMetrics.forContainer( containerId, containerMetricsPeriodMs, containerMetricsUnregisterDelayMs).finished();
          Hide
          jlowe Jason Lowe added a comment -

          Reopening to target a fix just to the DefaultMetricsSystem for 2.7 rather than pulling in the entire patch from YARN-5296 (and its dependencies).

          Show
          jlowe Jason Lowe added a comment - Reopening to target a fix just to the DefaultMetricsSystem for 2.7 rather than pulling in the entire patch from YARN-5296 (and its dependencies).
          Hide
          djp Junping Du added a comment -

          The metrics are getting cleaned up in the monitoring thread's run() method.

          I see. The monitor thread code of run() is significantly different with trunk now so I missed that before.

          Put a patch to include hadoop-common fix only (with test in TestContainerMetrics).

          Show
          djp Junping Du added a comment - The metrics are getting cleaned up in the monitoring thread's run() method. I see. The monitor thread code of run() is significantly different with trunk now so I missed that before. Put a patch to include hadoop-common fix only (with test in TestContainerMetrics).
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 11m 3s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          0 mvndep 0m 22s Maven dependency ordering for branch
          +1 mvninstall 14m 15s branch-2.7 passed
          +1 compile 5m 48s branch-2.7 passed with JDK v1.8.0_91
          +1 compile 6m 30s branch-2.7 passed with JDK v1.7.0_101
          +1 checkstyle 1m 21s branch-2.7 passed
          +1 mvnsite 1m 18s branch-2.7 passed
          +1 mvneclipse 0m 29s branch-2.7 passed
          -1 findbugs 1m 41s hadoop-common-project/hadoop-common in branch-2.7 has 3 extant Findbugs warnings.
          +1 javadoc 1m 7s branch-2.7 passed with JDK v1.8.0_91
          +1 javadoc 1m 20s branch-2.7 passed with JDK v1.7.0_101
          0 mvndep 0m 13s Maven dependency ordering for patch
          +1 mvninstall 1m 1s the patch passed
          +1 compile 7m 1s the patch passed with JDK v1.8.0_91
          +1 javac 7m 1s the patch passed
          +1 compile 6m 13s the patch passed with JDK v1.7.0_101
          +1 javac 6m 13s the patch passed
          +1 checkstyle 1m 17s the patch passed
          +1 mvnsite 1m 17s the patch passed
          +1 mvneclipse 0m 25s the patch passed
          -1 whitespace 0m 1s The patch has 2857 line(s) that end in whitespace. Use git apply --whitespace=fix.
          -1 whitespace 1m 33s The patch 124 line(s) with tabs.
          +1 findbugs 2m 49s the patch passed
          +1 javadoc 1m 15s the patch passed with JDK v1.8.0_91
          +1 javadoc 1m 22s the patch passed with JDK v1.7.0_101
          -1 unit 20m 24s hadoop-common in the patch failed with JDK v1.7.0_101.
          -1 unit 7m 9s hadoop-yarn-server-nodemanager in the patch failed with JDK v1.7.0_101.
          +1 asflicense 0m 21s The patch does not generate ASF License warnings.
          127m 17s



          Reason Tests
          JDK v1.8.0_91 Failed junit tests hadoop.ipc.TestRPC
            hadoop.ha.TestZKFailoverController
            hadoop.ipc.TestIPC
          JDK v1.8.0_91 Timed out junit tests org.apache.hadoop.conf.TestConfiguration
          JDK v1.7.0_101 Failed junit tests hadoop.ha.TestZKFailoverController
            hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService
          JDK v1.7.0_101 Timed out junit tests org.apache.hadoop.conf.TestConfiguration



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:c420dfe
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12817183/HADOOP-13362-branch-2.7.patch
          JIRA Issue HADOOP-13362
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 44e073b71333 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision branch-2.7 / 830516b
          Default Java 1.7.0_101
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_101
          findbugs v3.0.0
          findbugs https://builds.apache.org/job/PreCommit-HADOOP-Build/9956/artifact/patchprocess/branch-findbugs-hadoop-common-project_hadoop-common-warnings.html
          whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/9956/artifact/patchprocess/whitespace-eol.txt
          whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/9956/artifact/patchprocess/whitespace-tabs.txt
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/9956/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common-jdk1.7.0_101.txt
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/9956/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.7.0_101.txt
          JDK v1.7.0_101 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/9956/testReport/
          modules C: hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/9956/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 11m 3s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. 0 mvndep 0m 22s Maven dependency ordering for branch +1 mvninstall 14m 15s branch-2.7 passed +1 compile 5m 48s branch-2.7 passed with JDK v1.8.0_91 +1 compile 6m 30s branch-2.7 passed with JDK v1.7.0_101 +1 checkstyle 1m 21s branch-2.7 passed +1 mvnsite 1m 18s branch-2.7 passed +1 mvneclipse 0m 29s branch-2.7 passed -1 findbugs 1m 41s hadoop-common-project/hadoop-common in branch-2.7 has 3 extant Findbugs warnings. +1 javadoc 1m 7s branch-2.7 passed with JDK v1.8.0_91 +1 javadoc 1m 20s branch-2.7 passed with JDK v1.7.0_101 0 mvndep 0m 13s Maven dependency ordering for patch +1 mvninstall 1m 1s the patch passed +1 compile 7m 1s the patch passed with JDK v1.8.0_91 +1 javac 7m 1s the patch passed +1 compile 6m 13s the patch passed with JDK v1.7.0_101 +1 javac 6m 13s the patch passed +1 checkstyle 1m 17s the patch passed +1 mvnsite 1m 17s the patch passed +1 mvneclipse 0m 25s the patch passed -1 whitespace 0m 1s The patch has 2857 line(s) that end in whitespace. Use git apply --whitespace=fix. -1 whitespace 1m 33s The patch 124 line(s) with tabs. +1 findbugs 2m 49s the patch passed +1 javadoc 1m 15s the patch passed with JDK v1.8.0_91 +1 javadoc 1m 22s the patch passed with JDK v1.7.0_101 -1 unit 20m 24s hadoop-common in the patch failed with JDK v1.7.0_101. -1 unit 7m 9s hadoop-yarn-server-nodemanager in the patch failed with JDK v1.7.0_101. +1 asflicense 0m 21s The patch does not generate ASF License warnings. 127m 17s Reason Tests JDK v1.8.0_91 Failed junit tests hadoop.ipc.TestRPC   hadoop.ha.TestZKFailoverController   hadoop.ipc.TestIPC JDK v1.8.0_91 Timed out junit tests org.apache.hadoop.conf.TestConfiguration JDK v1.7.0_101 Failed junit tests hadoop.ha.TestZKFailoverController   hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService JDK v1.7.0_101 Timed out junit tests org.apache.hadoop.conf.TestConfiguration Subsystem Report/Notes Docker Image:yetus/hadoop:c420dfe JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12817183/HADOOP-13362-branch-2.7.patch JIRA Issue HADOOP-13362 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 44e073b71333 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2.7 / 830516b Default Java 1.7.0_101 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_101 findbugs v3.0.0 findbugs https://builds.apache.org/job/PreCommit-HADOOP-Build/9956/artifact/patchprocess/branch-findbugs-hadoop-common-project_hadoop-common-warnings.html whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/9956/artifact/patchprocess/whitespace-eol.txt whitespace https://builds.apache.org/job/PreCommit-HADOOP-Build/9956/artifact/patchprocess/whitespace-tabs.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/9956/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common-jdk1.7.0_101.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/9956/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.7.0_101.txt JDK v1.7.0_101 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/9956/testReport/ modules C: hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/9956/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          +1 lgtm. Verified the test in TestContainerMetrics fails without the change in hadoop-common and passes afterwards. The other test failures appear to be unrelated, and the tests pass for me locally with the patch applied.

          Committing this.

          Show
          jlowe Jason Lowe added a comment - +1 lgtm. Verified the test in TestContainerMetrics fails without the change in hadoop-common and passes afterwards. The other test failures appear to be unrelated, and the tests pass for me locally with the patch applied. Committing this.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks, Junping! I committed this to branch-2.7.

          Show
          jlowe Jason Lowe added a comment - Thanks, Junping! I committed this to branch-2.7.
          Hide
          djp Junping Du added a comment -

          Thanks Jason Lowe for review and commit!

          Show
          djp Junping Du added a comment - Thanks Jason Lowe for review and commit!
          Hide
          djp Junping Du added a comment -

          Hi Jason Lowe, forget to ask, any reason to not putting this patch to 2.7.3?

          Show
          djp Junping Du added a comment - Hi Jason Lowe , forget to ask, any reason to not putting this patch to 2.7.3?
          Hide
          jlowe Jason Lowe added a comment -

          any reason to not putting this patch to 2.7.3?

          2.7.3 was in lockdown for only critical blockers that were preventing it from being released, so I just committed it to branch-2.7. I'm OK with it going into 2.7.3 if there's still a chance to do so, assuming the 2.7.3 release manager is also OK with it.

          Show
          jlowe Jason Lowe added a comment - any reason to not putting this patch to 2.7.3? 2.7.3 was in lockdown for only critical blockers that were preventing it from being released, so I just committed it to branch-2.7. I'm OK with it going into 2.7.3 if there's still a chance to do so, assuming the 2.7.3 release manager is also OK with it.
          Hide
          djp Junping Du added a comment -

          2.7.3 was in lockdown for only critical blockers that were preventing it from being released, so I just committed it to branch-2.7.

          I see. We do see this happen a lot recently in some clusters with enabling container metrics. So I wish we could have it in 2.7.3.

          I'm OK with it going into 2.7.3 if there's still a chance to do so, assuming the 2.7.3 release manager is also OK with it.

          Agree. Even this should not block 2.7.3 release, but we should try to get it in if by any chance. Vinod Kumar Vavilapalli, I saw you leave comments in 2.7.3-rc0 voting thread that rc0 will be withdrawn. Can you confirm this patch can be backport to 2.7.3? Thanks!

          Show
          djp Junping Du added a comment - 2.7.3 was in lockdown for only critical blockers that were preventing it from being released, so I just committed it to branch-2.7. I see. We do see this happen a lot recently in some clusters with enabling container metrics. So I wish we could have it in 2.7.3. I'm OK with it going into 2.7.3 if there's still a chance to do so, assuming the 2.7.3 release manager is also OK with it. Agree. Even this should not block 2.7.3 release, but we should try to get it in if by any chance. Vinod Kumar Vavilapalli , I saw you leave comments in 2.7.3-rc0 voting thread that rc0 will be withdrawn. Can you confirm this patch can be backport to 2.7.3? Thanks!
          Hide
          djp Junping Du added a comment -

          It sounds like we are missing this in 2.8 branch. Reopen this for backport to 2.8.1.

          Show
          djp Junping Du added a comment - It sounds like we are missing this in 2.8 branch. Reopen this for backport to 2.8.1.
          Hide
          djp Junping Du added a comment -

          I forget we have a different patch - YARN-5190 for 2.8 and after. Resolve it.

          Show
          djp Junping Du added a comment - I forget we have a different patch - YARN-5190 for 2.8 and after. Resolve it.

            People

            • Assignee:
              djp Junping Du
              Reporter:
              jlowe Jason Lowe
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development