[HUDI-5707] Support offset reset strategy w/ spark streaming read from hudi table - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: reader-core
Labels:
None

Description

For users reading hudi table in a streaming manner, we need to support offset reset strategy if the commit of interest it archived or cleaned up.

notes from the issue

In streaming read, user might want to get all incremental changes. from what I see, this is nothing but an incremental query on a hudi table. w/ incremental query, we do have fallback mechanism via hoodie.datasource.read.incr.fallback.fulltablescan.enable.

But in streaming read, the amount of data read might spike up(if we plan to do the same) and the user may not have provisioned higher resources for the job.

I am thinking, if we should add something like auto.offset.reset we have in kafka. If you know if we have something similar in streaming read from spark itself, we can leverage the same or add a new config in hoodie.

So, users can configure what they want to do in such cases:

whether they wish to resume reading from earliest valid commit from hudi.
// impl might be involved. since we need to dedect the commit which hasn't been cleaned by the cleaner yet.
Or do snapshot query w/ latest table state.
Fail the streaming read.

Attachments

Issue Links

links to

issue-7778

Activity

People

Assignee:: kazdy

Reporter:: sivabalan narayanan

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 06/Feb/23 03:26

Updated:: 23/Jun/23 14:20

Resolved:: 23/Jun/23 14:20