Details
-
Epic
-
Status: In Progress
-
Critical
-
Resolution: Unresolved
-
None
-
0
-
RFC-46: Engine Native Record Payloads
Description
These are the gaps that we need to fill for the new record merging API
- [P0]
HUDI-6702Extend merge API to support all merging operations (inserts, updates and deletes, including customized getInsertValue)- Option<Pair<HoodieRecord, Schema>> merge(Option<HoodieRecord> older, Schema oldSchema, Option<HoodieRecord> newer, Schema newSchema, TypedProperties props)
- [P0]
HUDI-6765Add merge mode to allow differentiation of dedup logic- Add a new argument of merge mode (pre-combine, or update) to the merge API for customized dedup (or merging of log records?), instead of using OperationModeAwareness
- [P0?]HUDI-6767 Simplify compatibility of HoodieRecord conversion
- HoodieRecordCompatibilityInterface provides adaption among any representation type (Avro, Row, etc.)
- Guarantee one type end-to-end: Avro, Row for Spark (RowData for Flink). For Avro log block, needs conversion from Avro to Row for Spark
- [P0]HUDI-6768 Revisit HoodieRecord design and how it affects e2e row writing
- HoodieRecord does not merely wrap engine-specific data structure; it also contains Java objects to store record key, location, etc.
- For end-to-end row writing, could we just use engine-specific type InternalRow instead of HoodieRecord<InternalRow> by appending key, location, etc. as row fields, to better leverage Spark's optimization on DataFrame with InternalRow?
- [P0] Bug fixes
HUDI-5807HoodieSparkParquetReader is not appending partition-path values
These are nice-to-haves but not on the critical path
- [P1] Make merge logic engine-agnostic
- Different engines need to implement the merging logic based in the engine-specific data structure (Spark's InternalRow, Flink's RowData, etc.) different HoodieRecordMerger implementation class. Providing getField API from the HoodieRecord could allow engine-agnostic merge logic.
- [P1]
HUDI-5249HUDI-5282 Implement MDT payload using new merge API- Only necessary if we use parquet as the base and log file format in MDT
- [P1]HUDI-3354 Existing engine-specific readers to use HoodieRecord
- As we will implement a new file-group readers and writers, we do not need to fix existing readers now
— OLD PLAN —
Currently Hudi is biased t/w assumption of particular payload representation (Avro), long-term we would like to steer away from this to keep the record payload be completely opaque, so that
- We can keep record payload representation engine-specific
- Avoid unnecessary serde loops (Engine-specific > Avro > Engine-specific > Binary)
Proposal
Phase 2: Revisiting Record Handling
T-shirt: 2-2.5 weeks
Goal: Avoid tight coupling with particular record representation on the Read Path (currently Avro) and enable
* Revisit RecordPayload APIs
-
- Deprecate getInsertValue and combineAndGetUpdateValue APIs replacing w/ new “opaque” APIs (not returning Avro payloads)
- Rebase RecordPayload hierarchy to be engine-specific:
- Common engine-specific base abstracting common functionality (Spark, Flink, Java)
- Each feature-specific semantic will have to implement for all engines
- Introduce new APIs
- To access keys (record, partition)
- To convert record to Avro (for BWC)
- Revisit RecordPayload handling
- In WriteHandles
- API will be accepting opaque RecordPayload (no Avro conversion)
- Can do (opaque) record merging if necessary
- Passes RP as is to FileWriter
- In FileWriters
- Will accept RecordPayload interface
- Should be engine-specific (to handle internal record representation
- In RecordReaders
- API will be providing opaque RecordPayload (no Avro conversion)
- In WriteHandles
Attachments
Attachments
Issue Links
Issues in epic
HUDI-4358 | Standardize the order field(orderingVal/eventTime) of Hudi | Open | Jonathan Vexler | |||
HUDI-6767 | Simplify compatibility of HoodieRecord conversion | Open | Lin Liu | |||
|
HUDI-3908 | Profile MOR snapshot query flow | Closed | Unassigned | ||
HUDI-3291 | Flip Default record paylod to DefaultHoodieRecordPayload | In Progress | sivabalan narayanan | |||
|
HUDI-835 | refactor HoodieMergeHandle into factory pattern | Closed | satish | ||
|
HUDI-3381 | Rebase `HoodieMergeHandle` to operate on `HoodieRecord` | Closed | Frank Wong | ||
|
HUDI-3349 | Revisit HoodieRecord API to be able to replace HoodieRecordPayload | Closed | Frank Wong | ||
HUDI-3354 | Rebase `HoodieRealtimeRecordReader` to return `HoodieRecord` | Open | Alexey Kudinkin | |||
|
HUDI-3380 | Rebase `HoodieDataBlock`s to operate on `HoodieRecord` | Closed | Frank Wong | ||
|
HUDI-2598 | Redesign record payload class to decouple HoodieRecordPayload from Avro | Closed | Frank Wong | ||
|
HUDI-3238 | Survey usages of RecordPayload | Closed | Alexey Kudinkin | ||
|
HUDI-3318 | Write RFC regarding proposed changes to the RecordPayload hierarchy | Closed | Alexey Kudinkin | ||
|
HUDI-3350 | Create Engine-specific Implementations of `HoodieRecord` | Closed | Frank Wong | ||
|
HUDI-3351 | Rebase Record combining semantic into `HoodieRecordCombiningEngine` | Closed | Frank Wong | ||
|
HUDI-3353 | Rebase `HoodieFileWriter` to accept `HoodieRecord` | Closed | XiaoyuGeng | ||
|
HUDI-3378 | Rebase `HoodieCreateHandle` to operate on `HoodieRecord` | Closed | Frank Wong | ||
|
HUDI-3379 | Rebase `HoodieAppendHandle` to operate on `HoodieRecord` | Closed | Frank Wong | ||
|
HUDI-3384 | Implement Spark-specific FileWriters | Closed | XiaoyuGeng | ||
|
HUDI-3385 | Implement Spark-specific `FileReader`s | Closed | XiaoyuGeng | ||
|
HUDI-3410 | Revisit Record-reading Abstractions | Closed | Frank Wong | ||
HUDI-4321 | Fix Hudi to not write in Parquet legacy format | Open | Unassigned | |||
|
HUDI-4380 | Name of the Record Merge API | Closed | Frank Wong | ||
|
HUDI-5417 | Support to read avro from non-legacy map/list in parquet log | Closed | Frank Wong | ||
|
HUDI-5633 | Fixing HoodieSparkRecord performance bottlenecks | Closed | Alexey Kudinkin | ||
|
HUDI-4988 | Add Docs regarding Hudi RecordMerger | Closed | Frank Wong | ||
|
HUDI-5019 | Remove these unnecessary newInstance invocations | Closed | Danny Chen | ||
|
HUDI-5249 | Support MetadataColumnStatsIndex for Spark record | Closed | Lin Liu | ||
HUDI-5264 | Test parquet log with avro record in spark sql test | Open | Lin Liu | |||
|
HUDI-5281 | Rewrite HoodieSparkRecord with UnsafeRowWriter | Closed | Unassigned | ||
HUDI-5282 | Support Metadata in HoodieSparkRecord | Open | Unassigned | |||
|
HUDI-6702 | Extend merge API to support all merging operations | Closed | Lin Liu | ||
|
HUDI-6751 | Scope out remaining work for the record merging API | Closed | Ethan Guo (this is the old account; please use "yihua") | ||
|
HUDI-6784 | Clean Merger API and its invocations | Closed | Lin Liu | ||
|
HUDI-6810 | [RFC-46] Update merger API to support optional parameters | Closed | Lin Liu | ||
HUDI-6811 | Deprecate HoodieRecordPayload | Open | Lin Liu | |||
|
HUDI-6837 | Ensure the getInsertValue is wrapped correctly | Closed | Unassigned | ||
|
HUDI-6907 | E2E support HoodieSparkRecord | Closed | Lin Liu | ||
HUDI-6768 | Revisit HoodieRecord design and how it affects e2e row writing | Open | Ethan Guo (this is the old account; please use "yihua") | |||
HUDI-7678 | Finalize the Merger APIs and make a plan for moving over all existing built-in, custom payloads. | In Progress | Y Ethan Guo | |||
|
HUDI-6765 | Add merge mode to allow differentiation of dedup logic | Closed | Lin Liu | ||
|
HUDI-5807 | HoodieSparkParquetReader is not appending partition-path values | Closed | Jonathan Vexler |
HUDI-3217
RFC-46: Engine Native Record Payloads
false
HUDI-3217
RFC-46: Engine Native Record Payloads