In the current Spark, when submitting application on YARN with remote resources ./bin/spark-shell --jars http://central.maven.org/maven2/com/github/swagger-akka-http/swagger-akka-http_2.11/0.10.1/swagger-akka-http_2.11-0.10.1.jar --master yarn-client -v, Spark will be failed with:
This is because YARN#client assumes resources must be on the Hadoop compatible FS, also in the NM (https://github.com/apache/hadoop/blob/99e558b13ba4d5832aea97374e1d07b4e78e5e39/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java#L245) it will only use Hadoop compatible FS to download resources. So this makes Spark on YARN fail to support remote http(s) resources.
To solve this problem, there might be several options:
- Download remote http(s) resources to local and add this local downloaded resources to dist cache. The downside of this option is that remote resources will be uploaded again unnecessarily.
- Filter remote http(s) resources and add them with spark.jars or spark.files, to leverage Spark's internal fileserver to distribute remote http(s) resources. The problem of this solution is: for some resources which require to be available before application start may not work.
- Leverage Hadoop's support http(s) file system (https://issues.apache.org/jira/browse/HADOOP-14383). This is only worked in Hadoop 2.9+, and I think even we implement a similar one in Spark will not be worked.