Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17387

Creating SparkContext() from python without spark-submit ignores user conf

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 2.1.0
    • Component/s: PySpark
    • Labels:
      None

      Description

      Consider the following scenario: user runs a python application not through spark-submit, but by adding the pyspark module and manually creating a Spark context. Kinda like this:

      $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python
      Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
      [GCC 5.4.0 20160609] on linux2
      Type "help", "copyright", "credits" or "license" for more information.
      >>> from pyspark import SparkContext
      >>> from pyspark import SparkConf
      >>> conf = SparkConf().set("spark.driver.memory", "4g")
      >>> sc = SparkContext(conf=conf)
      

      If you look at the JVM launched by the pyspark code, it ignores the user's configuration:

      $ ps ax | grep $(pgrep -f SparkSubmit)
      12283 pts/2    Sl+    0:03 /apps/java7/bin/java -cp ... -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit pyspark-shell
      

      Note the "1g" of memory. If instead you use "pyspark", you get the correct "4g" in the JVM.

      This also affects other configs; for example, you can't really add jars to the driver's classpath using "spark.jars".

      You can work around this by setting the undocumented env variable Spark itself uses:

      $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python
      Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
      [GCC 5.4.0 20160609] on linux2
      Type "help", "copyright", "credits" or "license" for more information.
      >>> import os
      >>> os.environ['PYSPARK_SUBMIT_ARGS'] = "pyspark-shell --conf spark.driver.memory=4g"
      >>> from pyspark import SparkContext
      >>> sc = SparkContext()
      

      But it would be nicer if the configs were automatically propagated.

      BTW the reason for this is that the launch_gateway function used to start the JVM does not take any parameters, and the only place where it reads arguments for Spark is that env variable.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                zjffdu Jeff Zhang
                Reporter:
                vanzin Marcelo Vanzin
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: