[SPARK-9416] Yarn logs say that Spark Python job has succeeded even though job has failed in Yarn cluster mode - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.4.1
Fix Version/s: None
Component/s: PySpark
Labels:
None
Environment:

3.13.0-53-generic #89-Ubuntu SMP Wed May 20 10:34:39 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Description

While running Spark Word count python example with intentional mistake in Yarn cluster mode, Spark terminal logs (Yarn logs) states final status as SUCCEEDED, but log files for Spark application state correct results indicating that the job failed.

Terminal log output & application log output contradict each other.

If i run same job on local mode then terminal logs and application logs match, where both state that job has failed to expected error in python script.

More details: Scenario

While running Spark Word count python example on Yarn cluster mode, if I make intentional error in wordcount.py by changing this line (I'm using Spark 1.4.1, but this problem exists in Spark 1.4.0 and in 1.3.0 versions - which i tested):

lines = sc.textFile(sys.argv[1], 1)

into this line:

lines = sc.textFile(nonExistentVariable,1)

where nonExistentVariable variable was never created and initialized.

then i run that example with this command (I put README.md into HDFS before running this command):

./bin/spark-submit --master yarn-cluster wordcount.py /README.md

The job runs and finishes successfully according the log printed in the terminal :
Terminal logs:
...
15/07/23 16:19:17 INFO yarn.Client: Application report for application_1437612288327_0013 (state: RUNNING)
15/07/23 16:19:18 INFO yarn.Client: Application report for application_1437612288327_0013 (state: RUNNING)
15/07/23 16:19:19 INFO yarn.Client: Application report for application_1437612288327_0013 (state: RUNNING)
15/07/23 16:19:20 INFO yarn.Client: Application report for application_1437612288327_0013 (state: RUNNING)
15/07/23 16:19:21 INFO yarn.Client: Application report for application_1437612288327_0013 (state: FINISHED)
15/07/23 16:19:21 INFO yarn.Client:
client token: N/A
diagnostics: Shutdown hook called before final status was reported.
ApplicationMaster host: 10.0.53.59
ApplicationMaster RPC port: 0
queue: default
start time: 1437693551439
final status: SUCCEEDED
tracking URL: http://localhost:8088/proxy/application_1437612288327_0013/history/application_1437612288327_0013/1
user: edadashov
15/07/23 16:19:21 INFO util.Utils: Shutdown hook called
15/07/23 16:19:21 INFO util.Utils: Deleting directory /tmp/spark-eba0a1b5-a216-4afa-9c54-a3cb67b16444

But if look at log files generated for this application in HDFS - it indicates failure of the job with correct reason:
Application log files:
...
\00 stdout\00 179Traceback (most recent call last):
File "wordcount.py", line 32, in <module>
lines = sc.textFile(nonExistentVariable,1)
NameError: name 'nonExistentVariable' is not defined

(Yarn logs to) Terminal output - final status: SUCCEEDED , is not matching application log results - failure of the job (NameError: name 'nonExistentVariable' is not defined)

Attachments

Issue Links

duplicates

SPARK-7736 Exception not failing Python applications (in yarn cluster mode)

Resolved

links to

[Github] Pull Request #7751 (vanzin)

Activity

People

Assignee:: Unassigned

Reporter:: Elkhan Dadashov

Shepherd:: Marcelo Masiero Vanzin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 28/Jul/15 19:22

Updated:: 29/Jul/15 16:36

Resolved:: 29/Jul/15 06:19