[HUDI-1469] Faster initialization for larger datasets - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.7.0
Component/s: None
Labels:
- pull-request-available

Epic Link:
Metadata Table for File Listing & Query Planning

Description

For very large tables (200+ partitions and 100K+ files), the current initialization code in HoodieBackedTableMetadataWriter is slow as it uses a sequential listing to list all partitions and files.

Also, the above code is inefficient as it list each directory twice - first for getting list of partitions and later for getting list of files. This can be done together.

Attachments

Issue Links

links to

GitHub Pull Request #2343

Activity

People

Assignee:: Prashant Wason

Reporter:: Prashant Wason

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 17/Dec/20 22:27

Updated:: 04/Jan/22 00:09

Resolved:: 29/Dec/20 22:40

Time Tracking

Estimated:

Remaining:

Logged:

Not Specified