[SPARK-21056] InMemoryFileIndex.listLeafFiles should create at most one spark job when listing files in parallel - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.1.1, 2.2.0
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed

Description

Given partitioned file relation (e.g. parquet):

root/a=../b=../c=..

InMemoryFileIndex.listLeafFiles runs numberOfPartitions(a) times numberOfPartitions(b) spark jobs sequentially to list leaf files, if both numberOfPartitions(a) and numberOfPartitions(b) are below spark.sql.sources.parallelPartitionDiscovery.threshold and numberOfPartitions(c) is above spark.sql.sources.parallelPartitionDiscovery.threshold

Since the jobs are run sequentially, the overhead of the jobs dominates and the file listing operation can become significantly slower than listing the files from the driver.

I propose that InMemoryFileIndex.listLeafFiles should launch at most one spark job for listing leaf files.

Attachments

Issue Links

links to

[Github] Pull Request #18269 (bbossy)

Activity

People

Assignee:: Unassigned

Reporter:: Bertrand Bossy

Votes:: 6 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 11/Jun/17 11:22

Updated:: 21/May/19 04:12

Resolved:: 21/May/19 04:12