Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-6712

Implement optimized keyed lookup on parquet files

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • 1.1.0
    • None

    Description

      Parquet performs poorly when performing a lookup of specific records, based on a single key lookup column. 

      e.g: select * from parquet where key in ("a","b", "c) (SQL)
      e.g: List<Records> lookup(parquetFile, Set<String> keys) (code) 

      Let's implement a reader, that is optimized for this pattern, by scanning least amount of data. 

      Requirements: 
      1. Need to support multiple values for same key. 
      2. Can assume the file is sorted by the key/lookup field. 
      3. Should handle non-existence of keys.
      4. Should leverage parquet metadata (bloom filters, column index, ... ) to minimize read read. 
      5. Must to the minimum about of RPC calls to cloud storage.

      Attachments

        Issue Links

          Activity

            People

              linliu Lin Liu
              vinoth Vinoth Chandar
              Vinoth Chandar
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: