Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Won't Fix
-
None
-
None
-
None
Description
For users reading hudi table in a streaming manner, we need to support offset reset strategy if the commit of interest it archived or cleaned up.
notes from the issue
In streaming read, user might want to get all incremental changes. from what I see, this is nothing but an incremental query on a hudi table. w/ incremental query, we do have fallback mechanism via hoodie.datasource.read.incr.fallback.fulltablescan.enable.
But in streaming read, the amount of data read might spike up(if we plan to do the same) and the user may not have provisioned higher resources for the job.
I am thinking, if we should add something like auto.offset.reset we have in kafka. If you know if we have something similar in streaming read from spark itself, we can leverage the same or add a new config in hoodie.
So, users can configure what they want to do in such cases:
- whether they wish to resume reading from earliest valid commit from hudi.
// impl might be involved. since we need to dedect the commit which hasn't been cleaned by the cleaner yet. - Or do snapshot query w/ latest table state.
- Fail the streaming read.
Attachments
Issue Links
- links to