[HUDI-6712] Implement optimized keyed lookup on parquet files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: 1.1.0
Component/s: None
Labels:
- pull-request-available

Epic Link:
1.X Format Changes

Description

Parquet performs poorly when performing a lookup of specific records, based on a single key lookup column.

e.g: select * from parquet where key in ("a","b", "c) (SQL)
e.g: List<Records> lookup(parquetFile, Set<String> keys) (code)

Let's implement a reader, that is optimized for this pattern, by scanning least amount of data.

Requirements:
1. Need to support multiple values for same key.
2. Can assume the file is sorted by the key/lookup field.
3. Should handle non-existence of keys.
4. Should leverage parquet metadata (bloom filters, column index, ... ) to minimize read read.
5. Must to the minimum about of RPC calls to cloud storage.

Attachments

Issue Links

Blocked

HUDI-6772 Handle missing index metadata for keyed lookup reader

Open

HUDI-6770 Improve on Key Lookup Reader

Closed

HUDI-6771 Support Bloom Filter in Keyed Lookup Reader

Patch Available

blocks

HUDI-6769 Integration test on Keyed Lookup Reader

Open

HUDI-6783 Add property interface support for key lookup reader

Open

links to

GitHub Pull Request #9564

(1 links to)

Activity

People

Assignee:: Lin Liu

Reporter:: Vinoth Chandar

Reviewers:: Vinoth Chandar

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 17/Aug/23 02:20

Updated:: 06/Sep/24 17:41