Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
2.3.3
-
None
Description
The doc says that "In general, configuration values explicitly set on a SparkConf take the highest precedence, then flags passed to spark-submit, then values in the defaults file", but when setting spark.local.dir through --conf with spark-submit, it still uses the values from ${SPARK_HOME}/conf/spark-defaults.conf, what's more, the Spark runtime UI environment variables shows the value from --conf, which is really misleading.
e.g.
I set submit my application through the command:
/opt/spark233/bin/spark-submit --properties-file /opt/spark.conf --conf spark.local.dir=/tmp/spark_local -v --class org.apache.spark.examples.mllib.SparseNaiveBayes --master spark://bdw-slave20:7077 /opt/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar hdfs://bdw-slave20:8020/Bayes/Input
the spark.local.dir in ${SPARK_HOME}/conf/spark-defaults.conf is:
spark.local.dir=/mnt/nvme1/spark_local
when the application is running, I found the intermediate shuffle data was wrote to /mnt/nvme1/spark_local, which is set through ${SPARK_HOME}/conf/spark-defaults.conf, but the Web UI shows that the environment value spark.local.dir=/tmp/spark_local.
The spark-submit verbose also shows spark.local.dir=/tmp/spark_local, it's misleading.
spark-submit verbose:
XXXX
Spark properties used, including those specified through
--conf and those from the properties file /opt/spark.conf:
(spark.local.dir,/tmp/spark_local)
(spark.default.parallelism,132)
(spark.driver.memory,10g)
(spark.executor.memory,352g)
XXXXX