Details
-
Epic
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
None
-
Implement DeltaStreamer Source for cloud object stores
Description
As discussed in HUDI-1723, we need a better implementation for Cloud object storage like AWS S3 or GCS, leveraging on change notification.
Also consider https://docs.databricks.com/spark/latest/structured-streaming/sqs.html
We need to look into current *DFSSource classes and see if we can add a new `DFSPathSelector` implementation, that fetech new files on cloud storage after a given point in time. The timestamp based approach used by existing path selector, largely works, but has corner cases as mentioned in HUDI-1723
Attachments
Issue Links
- Blocked
-
HUDI-4155 Support optional Source schema config for S3EventsHoodieIncrSource
- Open
- is a parent of
-
HUDI-4928 Use common configs for ingestion from S3, GCS etc
- Open
-
HUDI-4929 Refactor code that is common to all ingestions from cloud sources
- Open
- is related to
-
HUDI-1723 DFSPathSelector skips files with the same modify date when read up to source limit
- Resolved
- links to