Affects Version/s: None
Fix Version/s: None
Configuration options specified in the hadoop cluster *.xml config files can be marked as "final", indicating that they should not be overwritten by a client's configuration. Spark appears to be over-writing those options, the symptom of which is that local proxy settings overwrite the cluster-side proxy settings. This breaks things when trying to run jobs on a remote, firewalled, YARN cluster.
For example, with the configuration below, one should be able to establish a SOCKS proxy via ssh -D to a host that can "see" the cluster, and then submit jobs and run the driver on the local desktop/laptop:
Remote cluster-side core-site.xml:
This configuration ensures that the nodes within the cluster never use a proxy to talk to each other.
Local client-side core-site.xml:
Indeed, running a standard MapReduce job, the log files show that an override of a property marked <final> is attempted:
and the MR job proceeds and finishes normally.
On the other hand, a Spark job with the same configuration shows no such message and instead we see that the nodes within the cluster are not able to communicate:
Running tcpdump on the slave nodes shows that in the case of the MR job, packets are sent between slave nodes and the ResourceManager node indicating that no proxy is being used, while in the case of the Spark job no such connection is made.
A further indication that the cluster-side configuration is altered is that if a dedicated proxy server is set up in a way that both sides can see it, i.e. the local core-site.xml is changed to have
the Spark job (and the MR job) run fine, with all connections going through the dedicated proxy server. While this works, it's sub-optimal because it now requires that such a server be created, which may not always be possible because it requires privileged access to the gateway machine.
Therefore, it appears that Spark is perfectly happy running through a proxy in YARN mode, but that it garbles the cluster-side configuration even when properties are marked as <final>. I'm not sure if this is intended? Or is there some other way that preserving the "final" properties can be enforced?