Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-1030

unneeded file required when running pyspark program using yarn-client

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.8.1
    • 1.0.0
    • Deploy, PySpark, YARN
    • None

    Description

      I can successfully run a pyspark program using the yarn-client master using the following command:

      SPARK_JAR=$SPARK_HOME/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.1-incubating-hadoop2.2.0.jar \
      SPARK_YARN_APP_JAR=~/testdata.txt pyspark \
      test1.py
      

      However, the SPARK_YARN_APP_JAR doesn't make any sense; it's a Python program, and therefore there's no JAR. If I don't set the value, or if I set the value to a non-existent files, Spark gives me an error message.

      py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
      : org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set
      	at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:46)
      

      or

      py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
      : java.io.FileNotFoundException: File file:dummy.txt does not exist
      	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
      

      My program is very simple:

      from pyspark import SparkContext
      def main():
          sc = SparkContext("yarn-client", "Simple App")
          logData = sc.textFile("hdfs://localhost/user/training/weblogs/2013-09-15.log")
          numjpgs = logData.filter(lambda s: '.jpg' in s).count()
          print "Number of JPG requests: " + str(numjpgs)
      

      Although it reads the SPARK_YARN_APP_JAR file, it doesn't use the file at all; I can point it at anything, as long as it's a valid, accessible file, and it works the same.

      Although there's an obvious workaround for this bug, it's high priority from my perspective because I'm working on a course to teach people how to do this, and it's really hard to explain why this variable is needed!

      Attachments

        Issue Links

          Activity

            People

              joshrosen Josh Rosen
              dcarroll@cloudera.com Diana Carroll
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: