Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2669

Hadoop configuration is not localised when submitting job in yarn-cluster mode

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.0
    • Fix Version/s: 1.4.0
    • Component/s: YARN
    • Labels:
      None
    • Target Version/s:

      Description

      I'd like to propose a fix for a problem when Hadoop configuration is not localized when job is submitted in yarn-cluster mode. Here is a description from github pull request https://github.com/apache/spark/pull/1574

      This patch fixes a problem when Spark driver is run in the container
      managed by YARN ResourceManager it inherits configuration from a
      NodeManager process, which can be different from the Hadoop
      configuration present on the client (submitting machine). Problem is
      most vivid when fs.defaultFS property differs between these two.

      Hadoop MR solves it by serializing client's Hadoop configuration into
      job.xml in application staging directory and then making Application
      Master to use it. That guarantees that regardless of execution nodes
      configurations all application containers use same config identical to
      one on the client side.

      This patch uses similar approach. YARN ClientBase serializes
      configuration and adds it to ClientDistributedCacheManager under
      "job.xml" link name. ClientDistributedCacheManager is then utilizes
      Hadoop localizer to deliver it to whatever container is started by this
      application, including the one running Spark driver.

      YARN ClientBase also adds "SPARK_LOCAL_HADOOPCONF" env variable to AM
      container request which is then used by SparkHadoopUtil.newConfiguration
      to trigger new behavior when machine-wide hadoop configuration is merged
      with application specific job.xml (exactly how it is done in Hadoop MR).

      SparkContext is then follows same approach, adding
      SPARK_LOCAL_HADOOPCONF env to all spawned containers to make them use
      client-side Hadopo configuration.

      Also all the references to "new Configuration()" which might be executed
      on YARN cluster side are changed to use SparkHadoopUtil.get.conf

      Please note that it fixes only core Spark, the part which I am
      comfortable to test and verify the result. I didn't descend into
      steaming/shark directories, so things might need to be changed there too.

        Attachments

          Activity

            People

            • Assignee:
              vanzin Marcelo Vanzin
              Reporter:
              rossmohax Maxim Ivanov
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: