Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21917

Remote http(s) resources is not supported in YARN mode

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.2.0
    • Fix Version/s: 2.3.0
    • Component/s: Spark Submit, YARN
    • Labels:
      None

      Description

      In the current Spark, when submitting application on YARN with remote resources ./bin/spark-shell --jars http://central.maven.org/maven2/com/github/swagger-akka-http/swagger-akka-http_2.11/0.10.1/swagger-akka-http_2.11-0.10.1.jar --master yarn-client -v, Spark will be failed with:

      java.io.IOException: No FileSystem for scheme: http
      	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
      	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
      	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
      	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
      	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
      	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
      	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
      	at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:354)
      	at org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:478)
      	at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$6.apply(Client.scala:600)
      	at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$6.apply(Client.scala:599)
      	at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
      	at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:599)
      	at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:598)
      	at scala.collection.immutable.List.foreach(List.scala:381)
      	at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:598)
      	at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:848)
      	at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:173)
      

      This is because YARN#client assumes resources must be on the Hadoop compatible FS, also in the NM (https://github.com/apache/hadoop/blob/99e558b13ba4d5832aea97374e1d07b4e78e5e39/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java#L245) it will only use Hadoop compatible FS to download resources. So this makes Spark on YARN fail to support remote http(s) resources.

      To solve this problem, there might be several options:

      • Download remote http(s) resources to local and add this local downloaded resources to dist cache. The downside of this option is that remote resources will be uploaded again unnecessarily.
      • Filter remote http(s) resources and add them with spark.jars or spark.files, to leverage Spark's internal fileserver to distribute remote http(s) resources. The problem of this solution is: for some resources which require to be available before application start may not work.

        Attachments

          Activity

            People

            • Assignee:
              jerryshao Saisai Shao
              Reporter:
              jerryshao Saisai Shao
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: