Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-7751

Decommissioned NM leaves orphaned containers

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Recently we found some orphaned containers running on a decommissioned NM in our production cluster. The beginning of this problem is PCIE error of this node, one of local directories is not writable so that containers whose pid files located on it can't be cleanup successfully, after a few moments, NM changed to DECOMMISSIONED state and exited.

      Corresponding logs in NM:

      2018-01-12 21:31:38,495 WARN [DiskHealthMonitor-Timer] org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Directory /dump/2/nm-logs error, Directory is not writable: /dump/2/nm-logs, removing from list of valid directories
      
      2018-01-12 21:41:23,352 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_e37_1508697357114_216838_01_001812
      2018-01-12 21:41:25,601 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Could not get pid for container_e37_1508697357114_216838_01_001812. Waited for 2000 ms.
      
      

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            Tao Yang Tao Yang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: