[SPARK-21917] Remote http(s) resources is not supported in YARN mode - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.2.0
Fix Version/s: 2.3.0
Component/s: Spark Core, Spark Submit, YARN
Labels:
None

Description

In the current Spark, when submitting application on YARN with remote resources ./bin/spark-shell --jars http://central.maven.org/maven2/com/github/swagger-akka-http/swagger-akka-http_2.11/0.10.1/swagger-akka-http_2.11-0.10.1.jar --master yarn-client -v, Spark will be failed with:

java.io.IOException: No FileSystem for scheme: http
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
	at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:354)
	at org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:478)
	at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$6.apply(Client.scala:600)
	at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$6.apply(Client.scala:599)
	at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
	at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:599)
	at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:598)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:598)
	at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:848)
	at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:173)

This is because YARN#client assumes resources must be on the Hadoop compatible FS, also in the NM (https://github.com/apache/hadoop/blob/99e558b13ba4d5832aea97374e1d07b4e78e5e39/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java#L245) it will only use Hadoop compatible FS to download resources. So this makes Spark on YARN fail to support remote http(s) resources.

To solve this problem, there might be several options:

Download remote http(s) resources to local and add this local downloaded resources to dist cache. The downside of this option is that remote resources will be uploaded again unnecessarily.

Filter remote http(s) resources and add them with spark.jars or spark.files, to leverage Spark's internal fileserver to distribute remote http(s) resources. The problem of this solution is: for some resources which require to be available before application start may not work.

Leverage Hadoop's support http(s) file system (https://issues.apache.org/jira/browse/HADOOP-14383). This is only worked in Hadoop 2.9+, and I think even we implement a similar one in Spark will not be worked.

Attachments

Issue Links

links to

[Github] Pull Request #19130 (jerryshao)

[Github] Pull Request #21633 (jerryshao)

Activity

People

Assignee:: Saisai Shao

Reporter:: Saisai Shao

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 05/Sep/17 05:59

Updated:: 17/May/20 18:15

Resolved:: 19/Sep/17 14:27