Description
I can successfully run a pyspark program using the yarn-client master using the following command:
SPARK_JAR=$SPARK_HOME/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.1-incubating-hadoop2.2.0.jar \ SPARK_YARN_APP_JAR=~/testdata.txt pyspark \ test1.py
However, the SPARK_YARN_APP_JAR doesn't make any sense; it's a Python program, and therefore there's no JAR. If I don't set the value, or if I set the value to a non-existent files, Spark gives me an error message.
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:46)
or
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.io.FileNotFoundException: File file:dummy.txt does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
My program is very simple:
from pyspark import SparkContext def main(): sc = SparkContext("yarn-client", "Simple App") logData = sc.textFile("hdfs://localhost/user/training/weblogs/2013-09-15.log") numjpgs = logData.filter(lambda s: '.jpg' in s).count() print "Number of JPG requests: " + str(numjpgs)
Although it reads the SPARK_YARN_APP_JAR file, it doesn't use the file at all; I can point it at anything, as long as it's a valid, accessible file, and it works the same.
Although there's an obvious workaround for this bug, it's high priority from my perspective because I'm working on a course to teach people how to do this, and it's really hard to explain why this variable is needed!
Attachments
Issue Links
- is related to
-
SPARK-1053 Should not require SPARK_YARN_APP_JAR when running on YARN
- Resolved