I'd like to propose a fix for a problem when Hadoop configuration is not localized when job is submitted in yarn-cluster mode. Here is a description from github pull request https://github.com/apache/spark/pull/1574
This patch fixes a problem when Spark driver is run in the container
managed by YARN ResourceManager it inherits configuration from a
NodeManager process, which can be different from the Hadoop
configuration present on the client (submitting machine). Problem is
most vivid when fs.defaultFS property differs between these two.
Hadoop MR solves it by serializing client's Hadoop configuration into
job.xml in application staging directory and then making Application
Master to use it. That guarantees that regardless of execution nodes
configurations all application containers use same config identical to
one on the client side.
This patch uses similar approach. YARN ClientBase serializes
configuration and adds it to ClientDistributedCacheManager under
"job.xml" link name. ClientDistributedCacheManager is then utilizes
Hadoop localizer to deliver it to whatever container is started by this
application, including the one running Spark driver.
YARN ClientBase also adds "SPARK_LOCAL_HADOOPCONF" env variable to AM
container request which is then used by SparkHadoopUtil.newConfiguration
to trigger new behavior when machine-wide hadoop configuration is merged
with application specific job.xml (exactly how it is done in Hadoop MR).
SparkContext is then follows same approach, adding
SPARK_LOCAL_HADOOPCONF env to all spawned containers to make them use
client-side Hadopo configuration.
Also all the references to "new Configuration()" which might be executed
on YARN cluster side are changed to use SparkHadoopUtil.get.conf
Please note that it fixes only core Spark, the part which I am
comfortable to test and verify the result. I didn't descend into
steaming/shark directories, so things might need to be changed there too.