[HUDI-4812] Lazy partition listing and file groups fetching in Spark Query - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Blocker
Resolution: Done
Affects Version/s: None
Fix Version/s: 0.13.0
Component/s: spark
Labels:
- pull-request-available

Story Points:
20
Epic Link:
Hudi Spark Datasource

Description

In current spark query implementation, the FileIndex will refresh and load all file groups in cached in order to serve subsequent queries.

For large table with many partitions, this may introduce much overhead in initialization. Meanwhile, the query itself may come with partition filter. So the loading of file groups will be unnecessary.

So to optimize, the whole refresh logic will become lazy, where actual work will be carried out only after the partition filter.

Attachments

Issue Links

fixes

HUDI-3717 Avoid double-listing w/in BaseHoodieTableFileIndex

Closed

links to

GitHub Pull Request #6680

GitHub Pull Request #7233

Activity

People

Assignee:: Yuwei Xiao

Reporter:: Yuwei Xiao

Reviewers:: Alexey Kudinkin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/Sep/22 07:27

Updated:: 30/Nov/22 23:54

Resolved:: 18/Nov/22 02:20