[SPARK-47717] Support Hive tables as a streaming source and sink - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.3.2, 3.4.1, 3.5.1
Fix Version/s: 3.3.2, 3.4.1, 3.5.1
Component/s: SQL
Labels:
None

Target Version/s:

3.3.2, 3.4.1, 3.5.1

Description

People have data stored in Hive tables. Currently these tables do not support Spark streaming, so customers do not have a good way to natively stream this data in Spark. The current solutions involve an intermediary to track which data has been read and periodically execute batch jobs. This use case should be supported by Spark's in-built streaming mechanism.

From doing some research, Hive supports streaming https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest+V2 but Spark does not support streaming on tables in Hive format. I don't think it makes sense to start copying Hive server-side code into Spark, but we could copy the relevant logic and wrap it in the DataSourceV2 APIs to enable this feature. To not break backwards compatibility, we would probably want to gate this behind a new Spark property.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Adi Suresh

Shepherd:: Adi Suresh

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 03/Apr/24 15:08

Updated:: 03/Apr/24 15:08