[HUDI-8451] Followup to fix all callers to HoodieLogRecordReader to set the right value for max instant time - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.1
Component/s: reader-core
Labels:
None

Description

As part of https://github.com/apache/hudi/pull/12033, we fixed an issue where log record reader was missing to read a data block in some edge cases.

The fix ensured log record reader will account for all rollback blocks dis-regarding the max instant time configured while reading log record reader.

But lets also follow through to see if we can fix all callers to set the right value for the max instant time.

Say, we have t1.dc, t2.dc and t2.dc crashed mid way.
Current layout is,
base file(t1), lf1(partially committed data w/ t2 as instant time)

Then we start t5.dc say. just when we start t5.dc, hudi detects pending commit and triggers a rollback. And this rollback will get an instant time of t6 (t6.rb). Note that rollback's commit time is greater than t5 or current ongoing delta commit.
So, once rollback completes, this is the layout.

base file, lf1(from t2.dc partially failed), lf3 (rollback command block with t6).

And once t5.dc completes, this is how the layout looks like

base file, lf1(from t2.dc partially failed), lf3 (rollback command block with t6). lf4 (from t5)

Callers involved:

This affects global indexes (simple, bloom) by not applying deletes. Non-global we read base files.. and with only updates in the log, it does not affect the tagging for non-global (bloom/simple).
Once there is a new commit, snapshot queries will start returning lf4. (almost eventually consistent behavior)
- - spark does not factor RBs in latestInstantTime..
- hive/trino/presto if they all use inputFormat BaseHoodieFileIndex#getLatestCompletedInstant handles this.
- Flink (FormatUtils is not handling this).
CDC: Also has issues. Irrespective of whether end instant time is set by the user or not.
Incremental queries : Just fixing lastInstant time alone may not suffice. since the instant time might be set by the user. So, we might have to remove "break" from within logRecordReader.
what about indexing? all new indexes added in 1.x
if clustering is scheduled, right after this. (or) executed inline right after this ➝ this is not an issue since clustering passes in its own instant time as latestInstantTime, passing the check and exposing lf4.
if compaction is scheduled, right after this (or) executed inline right after this ➝ this accounts by taking into account the rollback when passing lastInstantTime that includes rollback ts.

Attachments

Activity

People

Assignee:: Lin Liu

Reporter:: sivabalan narayanan

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 28/Oct/24 16:07

Updated:: 1 week ago 20:35

Resolved:: 1 week ago 20:35