[HUDI-7145] Support for grouping values for same key in HFile - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Closed
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: 1.0.0-beta2, 1.0.0
Component/s: None
Labels:
- hudi-1.0.0-beta2

Story Points:
6
Epic Link:
secondary index

Description

Hudi writes metadata table (MT) base files in HFile format. HFile stores sorted key-value pairs. For the existing MT partitions, the key is guaranteed to be unique. However, for secondary index, it is very likely that the same value of secondary index field is in multiple files.

This ticket is to microbenchmark two approaches of storing secondary index:

Group all values for a key and then store key-value pairs where each value in this pair is a collection. For example, say column c1 is the secondary index clumn with values v1 in files f1, f2 and value v2 in file f2. Then this approach means there is still just 2 keys as follows: i) v1: [f1, f2] and ii) v2: [f2].
Since each key-value pair is unique as a whole, so store each key-value pair separately (still lexicographically sorted). So, in this approach, we have 3 entries in hfile: i) v1: f1, ii) v1: f2 and iii) v2: f2.

The benchmark should capture storage overhead and lookup latency of one approach over the other.

Attachments

Activity

People

Assignee:: Sagar Sumit

Reporter:: Sagar Sumit

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 27/Nov/23 17:11

Updated:: 10/Jun/24 17:56

Resolved:: 04/Apr/24 01:45