[HUDI-3081] Revisiting Read Path Infra across Query Engines - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Epic
Status: Closed
Priority: Blocker
Resolution: Done
Affects Version/s: None
Fix Version/s: 0.11.0
Component/s: reader-core
Labels:
None

Story Points:
0
Epic Name:
Read-Path-Rewrite

Description

Currently, our Read-path infrastructure is mostly disparate for each individual Query Engine having the same flow replicated multiple times:

Hive leverages hierarchy based off `InputFormat` class
Spark leverages hierarchy based off `SnapshotRelation`

This leads to substantial duplication of virtually the same flows being replicated multiple times and unfortunately now diverging due to out of sync lifecycle (bug-fixes, etc).

Proposal

Phase 1: Abstracting Common Functionality

T-shirt: 1-1.5 weeks
Goal: Abstract following common items to avoid duplication of the complex sequences across Engines
* Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, RealtimeUnmergedRecordReader)

- These Readers should only differ in the way they handle the payload, everything else should remain constant
Abstract w/in common component (name TBD)
- Listing current file-slices at the requested instant (handling the timeline)
- Creating Record Iterator for the provided file-slice

REF

https://app.clickup.com/18029943/v/dc/h67bq-1900/h67bq-6680

Attachments

Issue Links

blocks

HUDI-2762 Ensure hive can query insert only logs in MOR

Reopened

is blocked by

HUDI-3279 Metadata table stores incorrect file sizes after Restore

Closed

is duplicated by

HUDI-3082 [Phase 1] Unify MOR table access across Spark, Hive

Closed

HUDI-2816 Unify file listing method of Spark/Flink/Hive

Closed

split to

HUDI-3247 Support incremental queries in AbstractHoodieTableFileIndex

Open

Activity

People

Assignee:: Alexey Kudinkin

Reporter:: Alexey Kudinkin

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 21/Dec/21 04:23

Updated:: 09/May/22 09:35

Resolved:: 26/Apr/22 00:22