Stream stream join is a much requested, but missing feature in Structured Streaming. While the join API exists in Datasets and DataFrames, it throws UnsupportedOperationException when applied between two streaming Datasets/DataFrames. To support this, we have to maintain the same semantics as other Structured Streaming operations - the result of the operation after consuming two data streams data till positions/offsets X and Y, respectively, must be the same as a single batch join operation on all the data till positions X and Y, respectively. To achieve this, the execution has to buffer past data (i.e. streaming state) from each stream, so that future data can be matched against past data. Here is the set of a few high-level requirements.
- Buffer past rows as streaming state (using StateStore), and joining with the past rows.
- Support state cleanup using the event time watermark when possible.
- Support different types of joins (inner, left outer, right outer is in highest demand for ETL/enrichment type use cases [kafka -> best-effort enrich -> write to S3])
- Support cascading join operations (i.e. joining more than 2 streams)
- Support multiple output modes (Append mode is in highest demand for enabling ETL/enrichment type use cases)
All the work to incrementally build this is going represented by this JIRA, with specific subtasks for each step. At this point, this is the rough direction as follows:
- Implement stream-stream inner join in Append Mode, supporting multiple cascaded joins.
- Extends it stream-stream left/right outer join in Append Mode
|Implement stream-stream inner join in Append mode||Resolved|
|Implement stream-stream outer joins in append mode||Resolved|
|Add documentation for stream-stream joins||Resolved|