Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17387

Creating SparkContext() from python without spark-submit ignores user conf

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.0.0
    • 2.1.0
    • PySpark
    • None

    Description

      Consider the following scenario: user runs a python application not through spark-submit, but by adding the pyspark module and manually creating a Spark context. Kinda like this:

      $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python
      Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
      [GCC 5.4.0 20160609] on linux2
      Type "help", "copyright", "credits" or "license" for more information.
      >>> from pyspark import SparkContext
      >>> from pyspark import SparkConf
      >>> conf = SparkConf().set("spark.driver.memory", "4g")
      >>> sc = SparkContext(conf=conf)
      

      If you look at the JVM launched by the pyspark code, it ignores the user's configuration:

      $ ps ax | grep $(pgrep -f SparkSubmit)
      12283 pts/2    Sl+    0:03 /apps/java7/bin/java -cp ... -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit pyspark-shell
      

      Note the "1g" of memory. If instead you use "pyspark", you get the correct "4g" in the JVM.

      This also affects other configs; for example, you can't really add jars to the driver's classpath using "spark.jars".

      You can work around this by setting the undocumented env variable Spark itself uses:

      $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python
      Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
      [GCC 5.4.0 20160609] on linux2
      Type "help", "copyright", "credits" or "license" for more information.
      >>> import os
      >>> os.environ['PYSPARK_SUBMIT_ARGS'] = "pyspark-shell --conf spark.driver.memory=4g"
      >>> from pyspark import SparkContext
      >>> sc = SparkContext()
      

      But it would be nicer if the configs were automatically propagated.

      BTW the reason for this is that the launch_gateway function used to start the JVM does not take any parameters, and the only place where it reads arguments for Spark is that env variable.

      Attachments

        Issue Links

          Activity

            People

              zjffdu Jeff Zhang
              vanzin Marcelo Masiero Vanzin
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: