Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-3217

RFC-46: Optimize Record Payload handling

    XMLWordPrintableJSON

Details

    • 0
    • RFC-46: Engine Native Record Payloads

    Description

      These are the gaps that we need to fill for the new record merging API

      • [P0]HUDI-6702 Extend merge API to support all merging operations (inserts, updates and deletes, including customized getInsertValue)
        • Option<Pair<HoodieRecord, Schema>> merge(Option<HoodieRecord> older, Schema oldSchema, Option<HoodieRecord> newer, Schema newSchema, TypedProperties props)
      • [P0]HUDI-6765 Add merge mode to allow differentiation of dedup logic
        • Add a new argument of merge mode (pre-combine, or update) to the merge API for customized dedup (or merging of log records?), instead of using OperationModeAwareness
      • [P0?]HUDI-6767 Simplify compatibility of HoodieRecord conversion
        • HoodieRecordCompatibilityInterface provides adaption among any representation type (Avro, Row, etc.)
        • Guarantee one type end-to-end: Avro, Row for Spark (RowData for Flink). For Avro log block, needs conversion from Avro to Row for Spark
      • [P0]HUDI-6768 Revisit HoodieRecord design and how it affects e2e row writing
        • HoodieRecord does not merely wrap engine-specific data structure; it also contains Java objects to store record key, location, etc.
        • For end-to-end row writing, could we just use engine-specific type InternalRow instead of HoodieRecord<InternalRow> by appending key, location, etc. as row fields, to better leverage Spark's optimization on DataFrame with InternalRow?
      • [P0] Bug fixes
        • HUDI-5807 HoodieSparkParquetReader is not appending partition-path values

      These are nice-to-haves but not on the critical path

      • [P1] Make merge logic engine-agnostic
        • Different engines need to implement the merging logic based in the engine-specific data structure (Spark's InternalRow, Flink's RowData, etc.) different HoodieRecordMerger implementation class. Providing getField API from the HoodieRecord could allow engine-agnostic merge logic.
      • [P1]HUDI-5249HUDI-5282 Implement MDT payload using new merge API
        • Only necessary if we use parquet as the base and log file format in MDT
      • [P1]HUDI-3354 Existing engine-specific readers to use HoodieRecord
        • As we will implement a new file-group readers and writers, we do not need to fix existing readers now

      — OLD PLAN —

      Currently Hudi is biased t/w assumption of particular payload representation (Avro), long-term we would like to steer away from this to keep the record payload be completely opaque, so that

      1. We can keep record payload representation engine-specific
      2. Avoid unnecessary serde loops (Engine-specific > Avro > Engine-specific > Binary)

      Proposal

       
      Phase 2: Revisiting Record Handling
      T-shirt: 2-2.5 weeks
      Goal: Avoid tight coupling with particular record representation on the Read Path (currently Avro) and enable
        * Revisit RecordPayload APIs

        • Deprecate getInsertValue and combineAndGetUpdateValue APIs replacing w/ new “opaque” APIs (not returning Avro payloads)
        • Rebase RecordPayload hierarchy to be engine-specific:
          • Common engine-specific base abstracting common functionality (Spark, Flink, Java)
          • Each feature-specific semantic will have to implement for all engines
        • Introduce new APIs
          • To access keys (record, partition)
          • To convert record to Avro (for BWC)
      • Revisit RecordPayload handling
        • In WriteHandles 
          • API will be accepting opaque RecordPayload (no Avro conversion)
          • Can do (opaque) record merging if necessary
          • Passes RP as is to FileWriter
        • In FileWriters
          • Will accept RecordPayload interface
          • Should be engine-specific (to handle internal record representation
        • In RecordReaders
          • API will be providing opaque RecordPayload (no Avro conversion)

       

       

      Attachments

        Issue Links

          Activity

            People

              guoyihua Ethan Guo (this is the old account; please use "yihua")
              alexey.kudinkin Alexey Kudinkin
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: