Details
-
New Feature
-
Status: Triage Needed
-
P2
-
Resolution: Fixed
-
None
Description
Why is it worth creating a new runner based on structured streaming:
Because this new framework brings:
- Unified batch and streaming semantics:
- no more RDD/DStream distinction, as in Beam (only PCollection)
- Better state management:
- incremental state instead of saving all each time
- No more synchronous saving delaying computation: per batch and partition delta file saved asynchronously + in-memory hashmap synchronous put/get
- Schemas in datasets:
- The dataset knows the structure of the data (fields) and can optimize later on
- Schemas in PCollection in Beam
- New Source API
- Very close to Beam bounded source and unbounded sources
Why make a new runner from scratch?
- Structured streaming framework is very different from the RDD/Dstream framework
We hope to gain
- More up to date runner in terms of libraries: leverage new features
- Leverage learnt practices from the previous runners
- Better performance thanks to the DAG optimizer (catalyst) and by simplifying the code.
- Simplify the code and ease the maintenance