XML

Word

Printable

JSON

Details

Type: New Feature
Status: Triage Needed
Priority: P2
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.18.0
Component/s: runner-spark
Labels:
- structured-streaming

Description

Why is it worth creating a new runner based on structured streaming:

Because this new framework brings:

Unified batch and streaming semantics:
no more RDD/DStream distinction, as in Beam (only PCollection)

Better state management:
incremental state instead of saving all each time
No more synchronous saving delaying computation: per batch and partition delta file saved asynchronously + in-memory hashmap synchronous put/get

Schemas in datasets:
The dataset knows the structure of the data (fields) and can optimize later on
Schemas in PCollection in Beam

New Source API
Very close to Beam bounded source and unbounded sources

Why make a new runner from scratch?

Structured streaming framework is very different from the RDD/Dstream framework

We hope to gain

More up to date runner in terms of libraries: leverage new features
Leverage learnt practices from the previous runners
Better performance thanks to the DAG optimizer (catalyst) and by simplifying the code.
Simplify the code and ease the maintenance

Attachments

Issue Links

is duplicated by

BEAM-198 Spark runner batch translator to work with Datasets instead of RDDs

Resolved

supercedes

BEAM-913 Create the skeleton for a Dataset API Spark runner

Resolved

links to

GitHub Pull Request #9866

GitHub Pull Request #10170

GitHub Pull Request #10171

GitHub Pull Request #10211

GitHub Pull Request #10221

(2 links to)

Activity

People

Assignee:: Etienne Chauchot

Reporter:: Etienne Chauchot

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 24/Oct/19 13:26

Updated:: 13/Apr/23 10:58

Resolved:: 05/Dec/19 15:09

Time Tracking

Estimated:

Not Specified

Remaining:

0h

Logged:

16h