[SPARK-17387] Creating SparkContext() from python without spark-submit ignores user conf - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.1.0
Component/s: PySpark
Labels:
None

Description

Consider the following scenario: user runs a python application not through spark-submit, but by adding the pyspark module and manually creating a Spark context. Kinda like this:

$ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python
Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyspark import SparkContext
>>> from pyspark import SparkConf
>>> conf = SparkConf().set("spark.driver.memory", "4g")
>>> sc = SparkContext(conf=conf)

If you look at the JVM launched by the pyspark code, it ignores the user's configuration:

$ ps ax | grep $(pgrep -f SparkSubmit)
12283 pts/2    Sl+    0:03 /apps/java7/bin/java -cp ... -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit pyspark-shell

Note the "1g" of memory. If instead you use "pyspark", you get the correct "4g" in the JVM.

This also affects other configs; for example, you can't really add jars to the driver's classpath using "spark.jars".

You can work around this by setting the undocumented env variable Spark itself uses:

$ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python
Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.environ['PYSPARK_SUBMIT_ARGS'] = "pyspark-shell --conf spark.driver.memory=4g"
>>> from pyspark import SparkContext
>>> sc = SparkContext()

But it would be nicer if the configs were automatically propagated.

BTW the reason for this is that the launch_gateway function used to start the JVM does not take any parameters, and the only place where it reads arguments for Spark is that env variable.

Attachments

Issue Links

is related to

SPARK-19307 SPARK-17387 caused ignorance of conf object passed to SparkContext:

Resolved

links to

[Github] Pull Request #14959 (zjffdu)

[Github] Pull Request #15019 (BryanCutler)

Activity

People

Assignee:: Jeff Zhang

Reporter:: Marcelo Masiero Vanzin

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 02/Sep/16 23:25

Updated:: 20/Jan/17 11:09

Resolved:: 11/Oct/16 21:57