[SPARK-29189] Add an option to ignore block locations when listing file - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.0
Fix Version/s: 3.0.0
Component/s: SQL
Labels:
None

Description

In our PROD env, we have a pure Spark cluster, I think this is also pretty common, where computation is separated from storage layer. In such deploy mode, data locality is never reachable.
And there are some configurations in Spark scheduler to reduce waiting time for data locality(e.g. "spark.locality.wait"). While, problem is that, in listing file phase, the location informations of all the files, with all the blocks inside each file, are all fetched from the distributed file system. Actually, in a PROD environment, a table can be so huge that even fetching all these location informations need take tens of seconds.
To improve such scenario, Spark need provide an option, where data locality can be totally ignored, all we need in the listing file phase are the files locations, without any block location informations.

And we made a benchmark in our PROD env, after ignore the block locations, we got a pretty huge improvement.

Table Size	Total File Number	Total Block Number	List File Duration(With Block Location)	List File Duration(Without Block Location)
22.6T	30000	120000	16.841s	1.730s
28.8 T	42001	148964	10.099s	2.858s
3.4 T	20000	20000	5.833s	4.881s

Attachments

Issue Links

links to

GitHub Pull Request #25869

GitHub Pull Request #26054

GitHub Pull Request #26056

Activity

People

Assignee:: Wang, Gang

Reporter:: Wang, Gang

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 20/Sep/19 12:50

Updated:: 24/Jan/20 23:44

Resolved:: 07/Oct/19 19:57