[SPARK-1030] unneeded file required when running pyspark program using yarn-client - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.8.1
Fix Version/s: 1.0.0
Component/s: Deploy, PySpark, YARN
Labels:
None

Description

I can successfully run a pyspark program using the yarn-client master using the following command:

SPARK_JAR=$SPARK_HOME/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.1-incubating-hadoop2.2.0.jar \
SPARK_YARN_APP_JAR=~/testdata.txt pyspark \
test1.py

However, the SPARK_YARN_APP_JAR doesn't make any sense; it's a Python program, and therefore there's no JAR. If I don't set the value, or if I set the value to a non-existent files, Spark gives me an error message.

py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set
	at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:46)

py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.io.FileNotFoundException: File file:dummy.txt does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)

My program is very simple:

from pyspark import SparkContext
def main():
    sc = SparkContext("yarn-client", "Simple App")
    logData = sc.textFile("hdfs://localhost/user/training/weblogs/2013-09-15.log")
    numjpgs = logData.filter(lambda s: '.jpg' in s).count()
    print "Number of JPG requests: " + str(numjpgs)

Although it reads the SPARK_YARN_APP_JAR file, it doesn't use the file at all; I can point it at anything, as long as it's a valid, accessible file, and it works the same.

Although there's an obvious workaround for this bug, it's high priority from my perspective because I'm working on a course to teach people how to do this, and it's really hard to explain why this variable is needed!

Attachments

Issue Links

is related to

SPARK-1053 Should not require SPARK_YARN_APP_JAR when running on YARN

Resolved

Activity

People

Assignee:: Josh Rosen

Reporter:: Diana Carroll

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/Jan/14 11:19

Updated:: 25/Jul/14 01:42

Resolved:: 25/Jul/14 01:42