Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-3217

RFC-46: Optimize Record Payload handling

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsResolve IssueClose Issue
    XMLWordPrintableJSON

Details

    • 0
    • RFC-46: Engine Native Record Payloads

    Description

      These are the gaps that we need to fill for the new record merging API

      • [P0]HUDI-6702 Extend merge API to support all merging operations (inserts, updates and deletes, including customized getInsertValue)
        • Option<Pair<HoodieRecord, Schema>> merge(Option<HoodieRecord> older, Schema oldSchema, Option<HoodieRecord> newer, Schema newSchema, TypedProperties props)
      • [P0]HUDI-6765 Add merge mode to allow differentiation of dedup logic
        • Add a new argument of merge mode (pre-combine, or update) to the merge API for customized dedup (or merging of log records?), instead of using OperationModeAwareness
      • [P0?]HUDI-6767 Simplify compatibility of HoodieRecord conversion
        • HoodieRecordCompatibilityInterface provides adaption among any representation type (Avro, Row, etc.)
        • Guarantee one type end-to-end: Avro, Row for Spark (RowData for Flink). For Avro log block, needs conversion from Avro to Row for Spark
      • [P0]HUDI-6768 Revisit HoodieRecord design and how it affects e2e row writing
        • HoodieRecord does not merely wrap engine-specific data structure; it also contains Java objects to store record key, location, etc.
        • For end-to-end row writing, could we just use engine-specific type InternalRow instead of HoodieRecord<InternalRow> by appending key, location, etc. as row fields, to better leverage Spark's optimization on DataFrame with InternalRow?
      • [P0] Bug fixes
        • HUDI-5807 HoodieSparkParquetReader is not appending partition-path values

      These are nice-to-haves but not on the critical path

      • [P1] Make merge logic engine-agnostic
        • Different engines need to implement the merging logic based in the engine-specific data structure (Spark's InternalRow, Flink's RowData, etc.) different HoodieRecordMerger implementation class. Providing getField API from the HoodieRecord could allow engine-agnostic merge logic.
      • [P1]HUDI-5249HUDI-5282 Implement MDT payload using new merge API
        • Only necessary if we use parquet as the base and log file format in MDT
      • [P1]HUDI-3354 Existing engine-specific readers to use HoodieRecord
        • As we will implement a new file-group readers and writers, we do not need to fix existing readers now

      — OLD PLAN —

      Currently Hudi is biased t/w assumption of particular payload representation (Avro), long-term we would like to steer away from this to keep the record payload be completely opaque, so that

      1. We can keep record payload representation engine-specific
      2. Avoid unnecessary serde loops (Engine-specific > Avro > Engine-specific > Binary)

      Proposal

       
      Phase 2: Revisiting Record Handling
      T-shirt: 2-2.5 weeks
      Goal: Avoid tight coupling with particular record representation on the Read Path (currently Avro) and enable
        * Revisit RecordPayload APIs

        • Deprecate getInsertValue and combineAndGetUpdateValue APIs replacing w/ new “opaque” APIs (not returning Avro payloads)
        • Rebase RecordPayload hierarchy to be engine-specific:
          • Common engine-specific base abstracting common functionality (Spark, Flink, Java)
          • Each feature-specific semantic will have to implement for all engines
        • Introduce new APIs
          • To access keys (record, partition)
          • To convert record to Avro (for BWC)
      • Revisit RecordPayload handling
        • In WriteHandles 
          • API will be accepting opaque RecordPayload (no Avro conversion)
          • Can do (opaque) record merging if necessary
          • Passes RP as is to FileWriter
        • In FileWriters
          • Will accept RecordPayload interface
          • Should be engine-specific (to handle internal record representation
        • In RecordReaders
          • API will be providing opaque RecordPayload (no Avro conversion)

       

       

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            guoyihua Ethan Guo (this is the old account; please use "yihua") Assign to me
            alexey.kudinkin Alexey Kudinkin

            Dates

              Created:
              Updated:

              Slack

                Issue deployment