Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8035

Uncaught exception in ContainersMonitorImpl during relaunch due to the process ID changing

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.2.0, 3.3.0
    • None
    • None
    • Reviewed

    Description

      In the case of a container relaunch event, the container ID is reused but a new process is spawned. For resource monitoring, ContainersMonitorImpl will obtain the new PID post relaunch and initialize the process tree monitoring. As part of this initialization, a tag called ContainerPid, whose value is the PID for the container, is populated for the metrics associated with the container. If the prior container failed after its process started, the original PID will already be populated for the container, resulting in the MetricsException below.

      2018-03-16 11:59:02,563 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Uncaught exception in ContainersMonitorImpl while monitoring resource of container_1521201379995_0001_01_000002
      org.apache.hadoop.metrics2.MetricsException: Tag ContainerPid already exists!
      at org.apache.hadoop.metrics2.lib.MetricsRegistry.checkTagName(MetricsRegistry.java:433)
      at org.apache.hadoop.metrics2.lib.MetricsRegistry.tag(MetricsRegistry.java:394)
      at org.apache.hadoop.metrics2.lib.MetricsRegistry.tag(MetricsRegistry.java:400)
      at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.recordProcessId(ContainerMetrics.java:277)
      at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl$MonitoringThread.initializeProcessTrees(ContainersMonitorImpl.java:559)
      at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl$MonitoringThread.run(ContainersMonitorImpl.java:448)

      MetricsRegistry provides a tag method that allows for updating the value of an existing tag. Updating the value ensures that the PID associated with container is the currently running process, which appears to be an appropriate fix. However, it's unclear how this tag might be being used by other systems. I'm not finding any usage in Hadoop itself.

      Attachments

        1. YARN-8035.002.patch
          3 kB
          Shane Kumpf
        2. YARN-8035.001.patch
          3 kB
          Shane Kumpf

        Issue Links

          Activity

            People

              shanekumpf@gmail.com Shane Kumpf
              shanekumpf@gmail.com Shane Kumpf
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: