Details
-
Epic
-
Status: Closed
-
Blocker
-
Resolution: Done
-
None
-
None
-
0
-
Read-Path-Rewrite
Description
Currently, our Read-path infrastructure is mostly disparate for each individual Query Engine having the same flow replicated multiple times:
- Hive leverages hierarchy based off `InputFormat` class
- Spark leverages hierarchy based off `SnapshotRelation`
This leads to substantial duplication of virtually the same flows being replicated multiple times and unfortunately now diverging due to out of sync lifecycle (bug-fixes, etc).
Proposal
Phase 1: Abstracting Common Functionality
T-shirt: 1-1.5 weeks
Goal: Abstract following common items to avoid duplication of the complex sequences across Engines
* Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, RealtimeUnmergedRecordReader)
-
- These Readers should only differ in the way they handle the payload, everything else should remain constant
- Abstract w/in common component (name TBD)
- Listing current file-slices at the requested instant (handling the timeline)
- Creating Record Iterator for the provided file-slice
REF
Attachments
Issue Links
- blocks
-
HUDI-2762 Ensure hive can query insert only logs in MOR
- Reopened
- is blocked by
-
HUDI-3279 Metadata table stores incorrect file sizes after Restore
- Closed
- is duplicated by
-
HUDI-3082 [Phase 1] Unify MOR table access across Spark, Hive
- Closed
-
HUDI-2816 Unify file listing method of Spark/Flink/Hive
- Closed
- split to
-
HUDI-3247 Support incremental queries in AbstractHoodieTableFileIndex
- Open