Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-7145

Support for grouping values for same key in HFile

    XMLWordPrintableJSON

Details

    • Task
    • Status: Closed
    • Major
    • Resolution: Done
    • None
    • 1.0.0-beta2, 1.0.0
    • None

    Description

      Hudi writes metadata table (MT) base files in HFile format. HFile stores sorted key-value pairs. For the existing MT partitions, the key is guaranteed to be unique. However, for secondary index, it is very likely that the same value of secondary index field is in multiple files.

      This ticket is to microbenchmark two approaches of storing secondary index:

      1. Group all values for a key and then store key-value pairs where each value in this pair is a collection. For example, say column c1 is the secondary index clumn with values v1 in files f1, f2 and value v2 in file f2. Then this approach means there is still just 2 keys as follows: i) v1: [f1, f2] and ii) v2: [f2].
      2. Since each key-value pair is unique as a whole, so store each key-value pair separately (still lexicographically sorted). So, in this approach, we have 3 entries in hfile: i) v1: f1, ii) v1: f2 and iii) v2: f2.

      The benchmark should capture storage overhead and lookup latency of one approach over the other.

       

      Attachments

        Activity

          People

            codope Sagar Sumit
            codope Sagar Sumit
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: