Details
-
New Feature
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
Parquet performs poorly when performing a lookup of specific records, based on a single key lookup column.
e.g: select * from parquet where key in ("a","b", "c) (SQL)
e.g: List<Records> lookup(parquetFile, Set<String> keys) (code)
Let's implement a reader, that is optimized for this pattern, by scanning least amount of data.
Requirements:
1. Need to support multiple values for same key.
2. Can assume the file is sorted by the key/lookup field.
3. Should handle non-existence of keys.
4. Should leverage parquet metadata (bloom filters, column index, ... ) to minimize read read.
5. Must to the minimum about of RPC calls to cloud storage.
Attachments
Issue Links
- Blocked
-
HUDI-6772 Handle missing index metadata for keyed lookup reader
- Open
-
HUDI-6770 Improve on Key Lookup Reader
- Closed
-
HUDI-6771 Support Bloom Filter in Keyed Lookup Reader
- Patch Available
- blocks
-
HUDI-6769 Integration test on Keyed Lookup Reader
- Open
-
HUDI-6783 Add property interface support for key lookup reader
- Open
- links to