Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-9825

Spark overwrites remote cluster "final" properties with local config

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.3.0
    • YARN
    • None

    Description

      Configuration options specified in the hadoop cluster *.xml config files can be marked as "final", indicating that they should not be overwritten by a client's configuration. Spark appears to be over-writing those options, the symptom of which is that local proxy settings overwrite the cluster-side proxy settings. This breaks things when trying to run jobs on a remote, firewalled, YARN cluster.

      For example, with the configuration below, one should be able to establish a SOCKS proxy via ssh -D to a host that can "see" the cluster, and then submit jobs and run the driver on the local desktop/laptop:

      Remote cluster-side core-site.xml:

      <property>
          <name>hadoop.rpc.socket.factory.class.default</name>
          <value>org.apache.hadoop.net.StandardSocketFactory</value>
        <final>true</final>
      </property>
      

      This configuration ensures that the nodes within the cluster never use a proxy to talk to each other.

      Local client-side core-site.xml:

      <property>
        <name>hadoop.rpc.socket.factory.class.default</name>
          <value>org.apache.hadoop.net.SocksSocketFactory</value>
      </property>
      
      <property>
          <name>hadoop.socks.server</name>
          <value>localhost:9999</value>
      </property>
      

      Indeed, running a standard MapReduce job, the log files show that an override of a property marked <final> is attempted:

      2015-07-27 15:26:11,706 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: hadoop.rpc.socket.factory.class.default;  Ignoring.
      

      and the MR job proceeds and finishes normally.

      On the other hand, a Spark job with the same configuration shows no such message and instead we see that the nodes within the cluster are not able to communicate:

      15/07/27 15:25:43 INFO client.RMProxy: Connecting to ResourceManager at node1/10.211.55.101:8030
      15/07/27 15:25:43 INFO yarn.YarnRMClient: Registering the ApplicationMaster
      15/07/27 15:25:44 INFO ipc.Client: Retrying connect to server: node1/10.211.55.101:8030. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      

      Running tcpdump on the slave nodes shows that in the case of the MR job, packets are sent between slave nodes and the ResourceManager node indicating that no proxy is being used, while in the case of the Spark job no such connection is made.

      A further indication that the cluster-side configuration is altered is that if a dedicated proxy server is set up in a way that both sides can see it, i.e. the local core-site.xml is changed to have

      <property>
          <name>hadoop.socks.server</name>
          <value>node2:9999</value>
      </property>
      

      the Spark job (and the MR job) run fine, with all connections going through the dedicated proxy server. While this works, it's sub-optimal because it now requires that such a server be created, which may not always be possible because it requires privileged access to the gateway machine.

      Therefore, it appears that Spark is perfectly happy running through a proxy in YARN mode, but that it garbles the cluster-side configuration even when properties are marked as <final>. I'm not sure if this is intended? Or is there some other way that preserving the "final" properties can be enforced?

      Attachments

        Issue Links

          Activity

            People

              vanzin Marcelo Masiero Vanzin
              rrrrrok Rok Roskar
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: