Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8786

LinuxContainerExecutor fails sporadically in create_local_dirs

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Reopened
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.0.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      We started using CGroups with LinuxContainerExecutor recently, running Apache Hadoop 3.0.0. Occasionally (once out of many millions of tasks) a yarn container will fail with a message like the following:

      [2018-09-02 23:48:02.458691] 18/09/02 23:48:02 INFO container.ContainerImpl: Container container_1530684675517_516620_01_020846 transitioned from SCHEDULED to RUNNING
      [2018-09-02 23:48:02.458874] 18/09/02 23:48:02 INFO monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1530684675517_516620_01_020846
      [2018-09-02 23:48:02.506114] 18/09/02 23:48:02 WARN privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 35. Privileged Execution Operation Stderr:
      [2018-09-02 23:48:02.506159] Could not create container dirsCould not create local files and directories
      [2018-09-02 23:48:02.506220]
      [2018-09-02 23:48:02.506238] Stdout: main : command provided 1
      [2018-09-02 23:48:02.506258] main : run as user is nobody
      [2018-09-02 23:48:02.506282] main : requested yarn user is root
      [2018-09-02 23:48:02.506294] Getting exit code file...
      [2018-09-02 23:48:02.506307] Creating script paths...
      [2018-09-02 23:48:02.506330] Writing pid file...
      [2018-09-02 23:48:02.506366] Writing to tmp file /path/to/hadoop/yarn/local/nmPrivate/application_1530684675517_516620/container_1530684675517_516620_01_020846/container_1530684675517_516620_01_020846.pid.tmp
      [2018-09-02 23:48:02.506389] Writing to cgroup task files...
      [2018-09-02 23:48:02.506402] Creating local dirs...
      [2018-09-02 23:48:02.506414] Getting exit code file...
      [2018-09-02 23:48:02.506435] Creating script paths...
      
      

      Looking at the container executor source it's traceable to errors here: https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L1604
      And ultimately to https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L672

      The root failure seems to be in the underlying mkdir call, but that exit code / errno is swallowed so we don't have more details. We tend to see this when many containers start at the same time for the same application on a host, and suspect it may be related to some race conditions around those shared directories between containers for the same application.

      For example, this is a typical pattern in the audit logs:

      [2018-09-07 17:16:38.447654] 18/09/07 17:16:38 INFO nodemanager.NMAuditLogger: USER=root	IP=<> Container Request	TARGET=ContainerManageImpl	RESULT=SUCCESS	APPID=application_1530684675517_559126	CONTAINERID=container_1530684675517_559126_01_012871
      [2018-09-07 17:16:38.492298] 18/09/07 17:16:38 INFO nodemanager.NMAuditLogger: USER=root	IP=<> Container Request	TARGET=ContainerManageImpl	RESULT=SUCCESS	APPID=application_1530684675517_559126	CONTAINERID=container_1530684675517_559126_01_012870
      [2018-09-07 17:16:38.614044] 18/09/07 17:16:38 WARN nodemanager.NMAuditLogger: USER=root	OPERATION=Container Finished - Failed	TARGET=ContainerImpl	RESULT=FAILURE	DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE	APPID=application_1530684675517_559126	CONTAINERID=container_1530684675517_559126_01_012871
      
      

      Two containers for the same application starting in quick succession followed by the EXITED_WITH_FAILURE step (exit code 35).

      We plan to upgrade to 3.1.x soon but I don't expect this to be fixed by this, the only major JIRAs that affected the executor since 3.0.0 seem unrelated (https://github.com/apache/hadoop/commit/bc285da107bb84a3c60c5224369d7398a41db2d8 and https://github.com/apache/hadoop/commit/a82be7754d74f4d16b206427b91e700bb5f44d56)

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                jonbender-stripe Jon Bender
              • Votes:
                0 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated: