Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-3081

Revisiting Read Path Infra across Query Engines

    XMLWordPrintableJSON

Details

    • Epic
    • Status: Closed
    • Blocker
    • Resolution: Done
    • None
    • 0.11.0
    • reader-core
    • None
    • 0
    • Read-Path-Rewrite

    Description

      Currently, our Read-path infrastructure is mostly disparate for each individual Query Engine having the same flow replicated multiple times: 

      • Hive leverages hierarchy based off `InputFormat` class
      • Spark leverages hierarchy based off `SnapshotRelation`

      This leads to substantial duplication of virtually the same flows being replicated multiple times and unfortunately now diverging due to out of sync lifecycle (bug-fixes, etc).

      Proposal

       
      Phase 1: Abstracting Common Functionality
       
      T-shirt: 1-1.5 weeks
      Goal: Abstract following common items to avoid duplication of the complex sequences across Engines
        * Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, RealtimeUnmergedRecordReader)

        • These Readers should only differ in the way they handle the payload, everything else should remain constant
      • Abstract w/in common component (name TBD)
        • Listing current file-slices at the requested instant (handling the timeline)
        • Creating Record Iterator for the provided file-slice

       

      REF

      https://app.clickup.com/18029943/v/dc/h67bq-1900/h67bq-6680

      Attachments

        Issue Links

          Activity

            People

              alexey.kudinkin Alexey Kudinkin
              alexey.kudinkin Alexey Kudinkin
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: