[SPARK-19501] Slow checking if there are many spark.yarn.jars, which are already on HDFS - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.0.0, 2.0.1, 2.1.0
Fix Version/s: 2.0.3, 2.1.1, 2.2.0
Component/s: Spark Core, YARN
Labels:
None

Description

Hi, this is my first Spark issue submission and please excuse any inconsistencies.

I am experiencing a slower application startup time when I specify many files as spark.yarn.jars, by setting it as all JARs in an HDFS folder, such as hdfs://namenode/user/spark/lib/*.jar.

Since the JAR files are already on the same HDFS that YARN is running, the application should be very fast to startup. However, the delay is significant especially when spark-submit is running from a non-local network, because spark-yarn accesses the individual JAR files via HDFS, adding hundreds of RTT before the application is ready. The official spark distribution with Hadoop 2.7 has more than 200 jars, and >100 even if we exclude Hadoop and its dependencies.

There are currently two HDFS RPC calls for each file, once at ClientDistributedCacheManager.addResource calling fs.getFileStatus, and another at yarn.Client.copyFileToRemote calling fc.resolvePath. I suppose that both are unnecessary, since we already retrieved all FileStatuses, and that those are not symlinks.

To fix this, I suppose that we can modify addResource to use its statCache variable before making an HDFS RPC and populate statCache appropriately before calling addResource. Also, an optional boolean parameter of copyFileToRemote can be added to skip the symlink check.

Attachments

Issue Links

links to

[Github] Pull Request #16916 (jongwook)

Activity

People

Assignee:: Jong Wook Kim

Reporter:: Jong Wook Kim

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 07/Feb/17 20:19

Updated:: 17/May/20 18:13

Resolved:: 14/Feb/17 19:57