Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-10323 [Umbrella] YARN Diagnostic collector
  3. YARN-10264

Add container launch related env / classpath debug info to container logs when a container fails

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Sometimes when a container fails to launch, it can be pretty hard to figure out why it has failed.

      Similar to YARN-4309, we can add a switch to control if the printing of environment variables and Java classpath should be done.
      As a bonus, jdeps could also be utilized to print some verbose info about the classpath.

      When log aggregation occurs, all this information will automatically get collected and make debugging such container launch failures much easier.

      Below is an example output when the user faces a classpath configuration issue while launching an application:

      End of LogType:prelaunch.err
      ******************************************************************************
      2020-04-19 05:49:12,145 DEBUG:app_info:Diagnostics of the failed app
      2020-04-19 05:49:12,145 DEBUG:app_info:Application application_1587300264561_0001 failed 2 times due to AM Container for appattempt_1587300264561_0001_000002 exited with  exitCode: 1
      Failing this attempt.Diagnostics: [2020-04-19 12:45:01.955]Exception from container-launch.
      Container id: container_e60_1587300264561_0001_02_000001
      Exit code: 1
      Exception message: Launch container failed
      Shell output: main : command provided 1
      main : run as user is systest
      main : requested yarn user is systest
      Getting exit code file...
      Creating script paths...
      Writing pid file...
      Writing to tmp file /dataroot/ycloud/yarn/nm/nmPrivate/application_1587300264561_0001/container_e60_1587300264561_0001_02_000001/container_e60_1587300264561_0001_02_000001.pid.tmp
      Writing to cgroup task files...
      Creating local dirs...
      Launching container...
      Getting exit code file...
      Creating script paths...
      
      
      [2020-04-19 12:45:01.984]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
      Last 4096 bytes of prelaunch.err :
      Last 4096 bytes of stderr :
      Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster
      
      Please check whether your etc/hadoop/mapred-site.xml contains the below configuration:
      <property>
        <name>yarn.app.mapreduce.am.env</name>
        <value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
      </property>
      <property>
        <name>mapreduce.map.env</name>
        <value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
      </property>
      <property>
        <name>mapreduce.reduce.env</name>
        <value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
      </property>
      
      [2020-04-19 12:45:01.985]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
      Last 4096 bytes of prelaunch.err :
      Last 4096 bytes of stderr :
      Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster
      
      Please check whether your etc/hadoop/mapred-site.xml contains the below configuration:
      <property>
        <name>yarn.app.mapreduce.am.env</name>
        <value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
      </property>
      <property>
        <name>mapreduce.map.env</name>
        <value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
      </property>
      <property>
        <name>mapreduce.reduce.env</name>
        <value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
      </property>
      
      For more detailed output, check the application tracking page: http://quasar-plnefj-2.quasar-plnefj.root.hwx.site:8088/cluster/app/application_1587300264561_0001 Then click on links to logs of each attempt.
      ...
      2020-04-19 05:49:12,148 INFO:util:* End test_app_API (yarn.suite.YarnAPITests) *
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            snemeth Szilard Nemeth
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: