Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25869

Spark on YARN: the original diagnostics is missing when job failed maxAppAttempts times

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.1.1
    • None
    • Spark Core, YARN

    Description

      When configure spark on yarn, I submit job using below command:

      
       spark-submit  --class org.apache.spark.examples.SparkPi     --master yarn     --deploy-mode cluster     --driver-memory 127m  --driver-cores 1   --executor-memory 2048m     --executor-cores 1    --num-executors 10  --queue root.mr --conf spark.testing.reservedMemory=1048576 --conf spark.yarn.executor.memoryOverhead=50 --conf spark.yarn.driver.memoryOverhead=50 /opt/ZDH/parcels/lib/spark/examples/jars/spark-examples* 10000
      
      

      Apparently, the driver memory is not enough, but this can not be seen in spark client log:

      
      2018-10-29 19:28:34,658 INFO org.apache.spark.deploy.yarn.Client: Application report for application_1540536615315_0013 (state: ACCEPTED)
      2018-10-29 19:28:35,660 INFO org.apache.spark.deploy.yarn.Client: Application report for application_1540536615315_0013 (state: RUNNING)
      2018-10-29 19:28:35,660 INFO org.apache.spark.deploy.yarn.Client:
       client token: N/A
       diagnostics: N/A
       ApplicationMaster host: 10.43.183.143
       ApplicationMaster RPC port: 0
       queue: root.mr
       start time: 1540812501560
       final status: UNDEFINED
       tracking URL: http://zdh141:8088/proxy/application_1540536615315_0013/
       user: mr
      2018-10-29 19:28:36,663 INFO org.apache.spark.deploy.yarn.Client: Application report for application_1540536615315_0013 (state: FINISHED)
      2018-10-29 19:28:36,663 INFO org.apache.spark.deploy.yarn.Client:
       client token: N/A
       diagnostics: Shutdown hook called before final status was reported.
       ApplicationMaster host: 10.43.183.143
       ApplicationMaster RPC port: 0
       queue: root.mr
       start time: 1540812501560
       final status: FAILED
       tracking URL: http://zdh141:8088/proxy/application_1540536615315_0013/
       user: mr
      Exception in thread "main" org.apache.spark.SparkException: Application application_1540536615315_0013 finished with failed status
       at org.apache.spark.deploy.yarn.Client.run(Client.scala:1137)
       at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1183)
       at org.apache.spark.deploy.yarn.Client.main(Client.scala)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:498)
       at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
       at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
       at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
       at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
       at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
      2018-10-29 19:28:36,694 INFO org.apache.spark.util.ShutdownHookManager: Shutdown hook called
      2018-10-29 19:28:36,695 INFO org.apache.spark.util.ShutdownHookManager: Deleting directory /tmp/spark-96077be5-0dfa-496d-a6a0-96e83393a8d9
      
      

       

       

      Solution: after apply the patch, spark client log can be shown as:

      
      2018-10-29 19:27:32,962 INFO org.apache.spark.deploy.yarn.Client: Application report for application_1540536615315_0012 (state: RUNNING)
      2018-10-29 19:27:32,962 INFO org.apache.spark.deploy.yarn.Client:
       client token: N/A
       diagnostics: N/A
       ApplicationMaster host: 10.43.183.143
       ApplicationMaster RPC port: 0
       queue: root.mr
       start time: 1540812436656
       final status: UNDEFINED
       tracking URL: http://zdh141:8088/proxy/application_1540536615315_0012/
       user: mr
      2018-10-29 19:27:33,964 INFO org.apache.spark.deploy.yarn.Client: Application report for application_1540536615315_0012 (state: FAILED)
      2018-10-29 19:27:33,964 INFO org.apache.spark.deploy.yarn.Client:
       client token: N/A
       diagnostics: Application application_1540536615315_0012 failed 2 times due to AM Container for appattempt_1540536615315_0012_000002 exited with exitCode: -104
      For more detailed output, check application tracking page:http://zdh141:8088/cluster/app/application_1540536615315_0012Then, click on links to logs of each attempt.
      Diagnostics: virtual memory used. Killing container.
      Dump of the process-tree for container_e53_1540536615315_0012_02_000001 :
       |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
       |- 1532 1528 1528 1528 (java) 1209 174 3472551936 65185 /usr/java/jdk/bin/java -server -Xmx127m -Djava.io.tmpdir=/data3/zdh/yarn/local/usercache/mr/appcache/application_1540536615315_0012/container_e53_1540536615315_0012_02_000001/tmp -Xss32M -XX:MetaspaceSize=128M -XX:MaxMetaspaceSize=512M -Dspark.yarn.app.container.log.dir=/data1/zdh/yarn/logs/userlogs/application_1540536615315_0012/container_e53_1540536615315_0012_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.spark.examples.SparkPi --jar file:/opt/ZDH/parcels/lib/spark/examples/jars/spark-examples_2.11-2.2.1-zdh8.5.1.jar --arg 10000 --properties-file /data3/zdh/yarn/local/usercache/mr/appcache/application_1540536615315_0012/container_e53_1540536615315_0012_02_000001/__spark_conf__/__spark_conf__.properties
       |- 1528 1526 1528 1528 (bash) 0 0 108642304 309 /bin/bash -c LD_LIBRARY_PATH=/opt/ZDH/parcels/lib/hadoop/lib/native: /usr/java/jdk/bin/java -server -Xmx127m -Djava.io.tmpdir=/data3/zdh/yarn/local/usercache/mr/appcache/application_1540536615315_0012/container_e53_1540536615315_0012_02_000001/tmp '-Xss32M' '-XX:MetaspaceSize=128M' '-XX:MaxMetaspaceSize=512M' -Dspark.yarn.app.container.log.dir=/data1/zdh/yarn/logs/userlogs/application_1540536615315_0012/container_e53_1540536615315_0012_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.examples.SparkPi' --jar file:/opt/ZDH/parcels/lib/spark/examples/jars/spark-examples_2.11-2.2.1-zdh8.5.1.jar --arg '10000' --properties-file /data3/zdh/yarn/local/usercache/mr/appcache/application_1540536615315_0012/container_e53_1540536615315_0012_02_000001/__spark_conf__/__spark_conf__.properties 1> /data1/zdh/yarn/logs/userlogs/application_1540536615315_0012/container_e53_1540536615315_0012_02_000001/stdout 2> /data1/zdh/yarn/logs/userlogs/application_1540536615315_0012/container_e53_1540536615315_0012_02_000001/stderr
      
      Container killed on request. Exit code is 143
      Container exited with a non-zero exit code 143
      PmemUsageMBsMaxMBs is: 255.0 MBFailing this attempt. Failing the application.
       ApplicationMaster host: N/A
       ApplicationMaster RPC port: -1
       queue: root.mr
       start time: 1540812436656
       final status: FAILED
       tracking URL: http://zdh141:8088/cluster/app/application_1540536615315_0012
       user: mr
      2018-10-29 19:27:34,542 INFO org.apache.spark.deploy.yarn.Client: Deleted staging directory hdfs://nameservice/user/mr/.sparkStaging/application_1540536615315_0012
      Exception in thread "main" org.apache.spark.SparkException: Application application_1540536615315_0012 finished with failed status
       at org.apache.spark.deploy.yarn.Client.run(Client.scala:1137)
       at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1183)
       at org.apache.spark.deploy.yarn.Client.main(Client.scala)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:498)
       at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
       at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
       at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
       at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
       at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
      2018-10-29 19:27:34,548 INFO org.apache.spark.util.ShutdownHookManager: Shutdown hook called
      2018-10-29 19:27:34,549 INFO org.apache.spark.util.ShutdownHookManager: Deleting directory /tmp/spark-ce35f2ad-ec1f-4173-9441-163e2482ed61
      
      

      Now we can see the true reason for job failure from client!

      Attachments

        Activity

          People

            Unassigned Unassigned
            Cyl Yeliang Cang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: