[HUDI-3217] RFC-46: Optimize Record Payload handling - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Epic
Status: In Progress
Priority: Critical
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: 1.1.0
Component/s: storage-management, writer-core
Labels:
- hudi-umbrellas
- pull-request-available

Story Points:
0
Epic Name:
RFC-46: Engine Native Record Payloads

Description

These are the gaps that we need to fill for the new record merging API

[P0]~~HUDI-6702~~ Extend merge API to support all merging operations (inserts, updates and deletes, including customized getInsertValue)
- Option<Pair<HoodieRecord, Schema>> merge(Option<HoodieRecord> older, Schema oldSchema, Option<HoodieRecord> newer, Schema newSchema, TypedProperties props)
[P0]~~HUDI-6765~~ Add merge mode to allow differentiation of dedup logic
- Add a new argument of merge mode (pre-combine, or update) to the merge API for customized dedup (or merging of log records?), instead of using OperationModeAwareness
[P0?]HUDI-6767 Simplify compatibility of HoodieRecord conversion
- HoodieRecordCompatibilityInterface provides adaption among any representation type (Avro, Row, etc.)
- Guarantee one type end-to-end: Avro, Row for Spark (RowData for Flink). For Avro log block, needs conversion from Avro to Row for Spark
[P0]HUDI-6768 Revisit HoodieRecord design and how it affects e2e row writing
- HoodieRecord does not merely wrap engine-specific data structure; it also contains Java objects to store record key, location, etc.
- For end-to-end row writing, could we just use engine-specific type InternalRow instead of HoodieRecord<InternalRow> by appending key, location, etc. as row fields, to better leverage Spark's optimization on DataFrame with InternalRow?
[P0] Bug fixes
- ~~HUDI-5807~~ HoodieSparkParquetReader is not appending partition-path values

These are nice-to-haves but not on the critical path

[P1] Make merge logic engine-agnostic
- Different engines need to implement the merging logic based in the engine-specific data structure (Spark's InternalRow, Flink's RowData, etc.) different HoodieRecordMerger implementation class. Providing getField API from the HoodieRecord could allow engine-agnostic merge logic.
[P1]~~HUDI-5249~~HUDI-5282 Implement MDT payload using new merge API
- Only necessary if we use parquet as the base and log file format in MDT
[P1]HUDI-3354 Existing engine-specific readers to use HoodieRecord
- As we will implement a new file-group readers and writers, we do not need to fix existing readers now

— OLD PLAN —

Currently Hudi is biased t/w assumption of particular payload representation (Avro), long-term we would like to steer away from this to keep the record payload be completely opaque, so that

We can keep record payload representation engine-specific
Avoid unnecessary serde loops (Engine-specific > Avro > Engine-specific > Binary)

Proposal

Phase 2: Revisiting Record Handling
T-shirt: 2-2.5 weeks
Goal: Avoid tight coupling with particular record representation on the Read Path (currently Avro) and enable
* Revisit RecordPayload APIs

- Deprecate getInsertValue and combineAndGetUpdateValue APIs replacing w/ new “opaque” APIs (not returning Avro payloads)
- Rebase RecordPayload hierarchy to be engine-specific:
  - Common engine-specific base abstracting common functionality (Spark, Flink, Java)
  - Each feature-specific semantic will have to implement for all engines
- Introduce new APIs
  - To access keys (record, partition)
  - To convert record to Avro (for BWC)
Revisit RecordPayload handling
- In WriteHandles
  - API will be accepting opaque RecordPayload (no Avro conversion)
  - Can do (opaque) record merging if necessary
  - Passes RP as is to FileWriter
- In FileWriters
  - Will accept RecordPayload interface
  - Should be engine-specific (to handle internal record representation
- In RecordReaders
  - API will be providing opaque RecordPayload (no Avro conversion)

Attachments

Issue Links

blocks

HUDI-3323 Refactor: Metadata various partitions payload merging using delegation pattern

Open

is duplicated by

HUDI-2598 Redesign record payload class to decouple HoodieRecordPayload from Avro

Closed

is related to

HUDI-5158 Add column pruning support to any payload

Open

links to

GitHub Pull Request #4687

GitHub Pull Request #7512

Activity

People

Assignee:: Ethan Guo (this is the old account; please use "yihua")

Reporter:: Alexey Kudinkin

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Due:: 13/Sep/23

Created:: 11/Jan/22 17:06

Updated:: 13/Dec/24 01:27