Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
2.0.0, 2.0.1, 2.1.0
-
None
Description
Hi, this is my first Spark issue submission and please excuse any inconsistencies.
I am experiencing a slower application startup time when I specify many files as spark.yarn.jars, by setting it as all JARs in an HDFS folder, such as hdfs://namenode/user/spark/lib/*.jar.
Since the JAR files are already on the same HDFS that YARN is running, the application should be very fast to startup. However, the delay is significant especially when spark-submit is running from a non-local network, because spark-yarn accesses the individual JAR files via HDFS, adding hundreds of RTT before the application is ready. The official spark distribution with Hadoop 2.7 has more than 200 jars, and >100 even if we exclude Hadoop and its dependencies.
There are currently two HDFS RPC calls for each file, once at ClientDistributedCacheManager.addResource calling fs.getFileStatus, and another at yarn.Client.copyFileToRemote calling fc.resolvePath. I suppose that both are unnecessary, since we already retrieved all FileStatuses, and that those are not symlinks.
To fix this, I suppose that we can modify addResource to use its statCache variable before making an HDFS RPC and populate statCache appropriately before calling addResource. Also, an optional boolean parameter of copyFileToRemote can be added to skip the symlink check.