[SPARK-18791] Stream-Stream Joins - ASF JIRA

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: 2.3.0
Component/s: Structured Streaming
Labels:
None

Description

Stream stream join is a much requested, but missing feature in Structured Streaming. While the join API exists in Datasets and DataFrames, it throws UnsupportedOperationException when applied between two streaming Datasets/DataFrames. To support this, we have to maintain the same semantics as other Structured Streaming operations - the result of the operation after consuming two data streams data till positions/offsets X and Y, respectively, must be the same as a single batch join operation on all the data till positions X and Y, respectively. To achieve this, the execution has to buffer past data (i.e. streaming state) from each stream, so that future data can be matched against past data. Here is the set of a few high-level requirements.

Buffer past rows as streaming state (using StateStore), and joining with the past rows.
Support state cleanup using the event time watermark when possible.
Support different types of joins (inner, left outer, right outer is in highest demand for ETL/enrichment type use cases [kafka -> best-effort enrich -> write to S3])
Support cascading join operations (i.e. joining more than 2 streams)
Support multiple output modes (Append mode is in highest demand for enabling ETL/enrichment type use cases)

All the work to incrementally build this is going represented by this JIRA, with specific subtasks for each step. At this point, this is the rough direction as follows:

Implement stream-stream inner join in Append Mode, supporting multiple cascaded joins.
Extends it stream-stream left/right outer join in Append Mode

Attachments

Sub-Tasks

Create Sub-Task

1.	Implement stream-stream inner join in Append mode	Resolved	Tathagata Das	Actions
2.	Implement stream-stream outer joins in append mode	Resolved	Jose Torres	Actions
3.	Add documentation for stream-stream joins	Resolved	Tathagata Das	Actions

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Tathagata Das

Reporter:: Michael Armbrust

Votes:: 24 Vote for this issue

Watchers:: 26 Start watching this issue

Dates

Created:: 08/Dec/16 21:53

Updated:: 02/May/18 21:00

Resolved:: 02/May/18 21:00

Agile

View on Board

Stream-Stream Joins

Details

Description

Attachments

Attachments

Sub-Tasks

Activity

People

Dates

Agile

Slack

Issue deployment