Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-9416

Yarn logs say that Spark Python job has succeeded even though job has failed in Yarn cluster mode

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.4.1
    • None
    • PySpark
    • None
    • 3.13.0-53-generic #89-Ubuntu SMP Wed May 20 10:34:39 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

    Description

      While running Spark Word count python example with intentional mistake in Yarn cluster mode, Spark terminal logs (Yarn logs) states final status as SUCCEEDED, but log files for Spark application state correct results indicating that the job failed.

      Terminal log output & application log output contradict each other.

      If i run same job on local mode then terminal logs and application logs match, where both state that job has failed to expected error in python script.

      More details: Scenario

      While running Spark Word count python example on Yarn cluster mode, if I make intentional error in wordcount.py by changing this line (I'm using Spark 1.4.1, but this problem exists in Spark 1.4.0 and in 1.3.0 versions - which i tested):

      lines = sc.textFile(sys.argv[1], 1)

      into this line:

      lines = sc.textFile(nonExistentVariable,1)

      where nonExistentVariable variable was never created and initialized.

      then i run that example with this command (I put README.md into HDFS before running this command):

      ./bin/spark-submit --master yarn-cluster wordcount.py /README.md

      The job runs and finishes successfully according the log printed in the terminal :
      Terminal logs:
      ...
      15/07/23 16:19:17 INFO yarn.Client: Application report for application_1437612288327_0013 (state: RUNNING)
      15/07/23 16:19:18 INFO yarn.Client: Application report for application_1437612288327_0013 (state: RUNNING)
      15/07/23 16:19:19 INFO yarn.Client: Application report for application_1437612288327_0013 (state: RUNNING)
      15/07/23 16:19:20 INFO yarn.Client: Application report for application_1437612288327_0013 (state: RUNNING)
      15/07/23 16:19:21 INFO yarn.Client: Application report for application_1437612288327_0013 (state: FINISHED)
      15/07/23 16:19:21 INFO yarn.Client:
      client token: N/A
      diagnostics: Shutdown hook called before final status was reported.
      ApplicationMaster host: 10.0.53.59
      ApplicationMaster RPC port: 0
      queue: default
      start time: 1437693551439
      final status: SUCCEEDED
      tracking URL: http://localhost:8088/proxy/application_1437612288327_0013/history/application_1437612288327_0013/1
      user: edadashov
      15/07/23 16:19:21 INFO util.Utils: Shutdown hook called
      15/07/23 16:19:21 INFO util.Utils: Deleting directory /tmp/spark-eba0a1b5-a216-4afa-9c54-a3cb67b16444

      But if look at log files generated for this application in HDFS - it indicates failure of the job with correct reason:
      Application log files:
      ...
      \00 stdout\00 179Traceback (most recent call last):
      File "wordcount.py", line 32, in <module>
      lines = sc.textFile(nonExistentVariable,1)
      NameError: name 'nonExistentVariable' is not defined

      (Yarn logs to) Terminal output - final status: SUCCEEDED , is not matching application log results - failure of the job (NameError: name 'nonExistentVariable' is not defined)

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              edadashov Elkhan Dadashov
              Marcelo Masiero Vanzin Marcelo Masiero Vanzin
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: