[SPARK-27801] InMemoryFileIndex.listLeafFiles should use listLocatedStatus for DistributedFileSystem - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.3
Fix Version/s: 3.0.0
Component/s: SQL
Labels:
None

Docs Text:
Improve performance of HDFS file directory listings with large number of files

Description

Currently in InMemoryFileIndex, all directory listings are done using FileSystem.listStatus following by individual calls to FileSystem.getFileBlockLocations. This is painstakingly slow for folders that have large numbers of files because this process happens serially and parallelism is only applied at the folder level, not the file level.

FileSystem also provides another API listLocatedStatus which returns the LocatedFileStatus objects that already have the block locations. In FileSystem main class this just delegates to listStatus and getFileBlockLocations similarly to the way Spark does it. However when HDFS specifically is the backing file system, DistributedFileSystem overrides this method and simply makes one single call to the namenode to retrieve the directory listing with the block locations. This avoids potentially thousands or more calls to namenode and also is more consistent because files will either exist with locations or not exist instead of having the FileNotFoundException exception case.

For our example directory with 6500 files, the load time of spark.read.parquet was reduced 96x from 76 seconds to .8 seconds. This savings only goes up with the number of files in the directory.

In the pull request instead of using this method always which could lead to a FileNotFoundException that could be tough to decipher in the default FileSystem implementation, this method is only used when the FileSystem is a DistributedFileSystem and otherwise the old logic still applies.

Attachments

Issue Links

is duplicated by

SPARK-27807 Parallel resolve leaf statuses in InMemoryFileIndex

Resolved

is related to

SPARK-31047 Improve file listing for ViewFileSystem

Resolved

links to

GitHub Pull Request #24672

Activity

People

Assignee:: Rob Russo

Reporter:: Rob Russo

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 22/May/19 05:40

Updated:: 26/Mar/20 22:18

Resolved:: 25/May/19 22:51