Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5190

Registering/unregistering container metrics triggered by ContainerEvent and ContainersMonitorEvent are conflict which cause uncaught exception in ContainerMonitorImpl

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      The exception stack is as following:

      310735 2016-05-22 01:50:04,554 [Container Monitor] ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Container Monitor,5,main] threw an Exception.
      310736 org.apache.hadoop.metrics2.MetricsException: Metrics source ContainerResource_container_1463840817638_14484_01_000010 already exists!
      310737         at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:135)
      310738         at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:112)
      310739         at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
      310740         at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.forContainer(ContainerMetrics.java:212)
      310741         at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.forContainer(ContainerMetrics.java:198)
      310742         at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl$MonitoringThread.run(ContainersMonitorImpl.java:385)
      

      After YARN-4906, we have multiple places to get ContainerMetrics for a particular container that could cause race condition in registering the same container metrics to DefaultMetricsSystem by different threads. Lacking of proper handling of MetricsException which could get thrown, the exception will could bring down daemon of ContainerMonitorImpl or even whole NM.

        Attachments

        1. YARN-5190-branch-2.7.001.patch
          3 kB
          Wangda Tan
        2. YARN-5190-v2.patch
          7 kB
          Junping Du
        3. YARN-5190.patch
          8 kB
          Junping Du

          Issue Links

            Activity

              People

              • Assignee:
                djp Junping Du
                Reporter:
                djp Junping Du
              • Votes:
                0 Vote for this issue
                Watchers:
                13 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: