[SPARK-29089] DataFrameReader bottleneck in DataSource#checkAndGlobPathIfNecessary when reading large amount of S3 files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.0.0
Fix Version/s: 3.1.0
Component/s: SQL
Labels:
- pull-request-available

Description

When using DataFrameReader#csv to read many S3 files (in my case 300k), I've noticed that it took about an hour for the files to be loaded on the driver.

You can see the timestamp difference when the log from InMemoryFileIndex occurs from 7:45 to 8:54:

19/09/06 07:44:42 INFO SparkContext: Running Spark version 2.4.4
19/09/06 07:44:42 INFO SparkContext: Submitted application: LoglineParquetGenerator
...
19/09/06 07:45:40 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
19/09/06 08:54:57 INFO InMemoryFileIndex: Listing leaf files and directories in parallel under: [300K files...]

A major source of the bottleneck comes from DataSource#checkAndGlobPathIfNecessary, which will (possibly) glob and do a FileSystem#exists on all the paths in a single thread. On S3, these are slow network calls.

After a discussion on the mailing list [0], it was suggested that an improvement could be to:

have SparkHadoopUtils differentiate between files returned by globStatus(), and which therefore exist, and those which it didn't glob for -it will only need to check those.
add parallel execution to the glob and existence checks

I am currently working on a patch that implements this improvement

[0] http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html

Attachments

Issue Links

links to

GitHub Pull Request #25899

Activity

People

Assignee:: Arwin S Tio

Reporter:: Arwin S Tio

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 15/Sep/19 10:48

Updated:: 12/Apr/24 10:19

Resolved:: 17/Feb/20 15:31