Details
-
Task
-
Status: Closed
-
Major
-
Resolution: Done
-
None
-
None
Description
Hudi writes metadata table (MT) base files in HFile format. HFile stores sorted key-value pairs. For the existing MT partitions, the key is guaranteed to be unique. However, for secondary index, it is very likely that the same value of secondary index field is in multiple files.
This ticket is to microbenchmark two approaches of storing secondary index:
- Group all values for a key and then store key-value pairs where each value in this pair is a collection. For example, say column c1 is the secondary index clumn with values v1 in files f1, f2 and value v2 in file f2. Then this approach means there is still just 2 keys as follows: i) v1: [f1, f2] and ii) v2: [f2].
- Since each key-value pair is unique as a whole, so store each key-value pair separately (still lexicographically sorted). So, in this approach, we have 3 entries in hfile: i) v1: f1, ii) v1: f2 and iii) v2: f2.
The benchmark should capture storage overhead and lookup latency of one approach over the other.