[SPARK-20928] SPIP: Continuous Processing Mode for Structured Streaming - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.2.0
Fix Version/s: None
Component/s: Structured Streaming
Labels:
- SPIP
- bulk-closed

Description

Given the current Source API, the minimum possible latency for any record is bounded by the amount of time that it takes to launch a task. This limitation is a result of the fact that getBatch requires us to know both the starting and the ending offset, before any tasks are launched. In the worst case, the end-to-end latency is actually closer to the average batch time + task launching time.

For applications where latency is more important than exactly-once output however, it would be useful if processing could happen continuously. This would allow us to achieve fully pipelined reading and writing from sources such as Kafka. This kind of architecture would make it possible to process records with end-to-end latencies on the order of 1 ms, rather than the 10-100ms that is possible today.

One possible architecture here would be to change the Source API to look like the following rough sketch:

  trait Epoch {
    def data: DataFrame

    /** The exclusive starting position for `data`. */
    def startOffset: Offset

    /** The inclusive ending position for `data`.  Incrementally updated during processing, but not complete until execution of the query plan in `data` is finished. */
    def endOffset: Offset
  }

  def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], limits: Limits): Epoch

The above would allow us to build an alternative implementation of StreamExecution that processes continuously with much lower latency and only stops processing when needing to reconfigure the stream (either due to a failure or a user requested change in parallelism.

Attachments

Continuous Processing in Structured Streaming Design Sketch.pdf
23/Oct/17 22:07
110 kB
Reynold Xin

Issue Links

Add Link

is related to

SPARK-24374 SPIP: Support Barrier Execution Mode in Apache Spark

Resolved

Delete this link

Sub-Tasks

Create Sub-Task

1.	Add DataSourceV2 streaming APIs	Resolved	Jose Torres	Actions
2.	refactor StreamExecution for extensibility	Resolved	Jose Torres	Actions
3.	Add ContinuousExecution for continuous processing queries	Resolved	Jose Torres	Actions
4.	Make MicroBatchExecution also support MicroBatchRead/WriteSupport	Resolved	Jose Torres	Actions
5.	add basic continuous kafka source	Resolved	Jose Torres	Actions
6.	Move Structured Streaming v2 APIs to streaming package	Resolved	Shixiong Zhu	Actions
7.	disable task-level retry for continuous execution	Resolved	Jose Torres	Actions
8.	don't modify run id	Resolved	Jose Torres	Actions
9.	continuous execution should sequence committed epochs	Resolved	Efim Poberezkin	Actions
10.	Add MemoryStream	Resolved	Jose Torres	Actions
11.	Refactor tests away from rate source	Resolved	Jungtaek Lim	Actions
12.	Add EpochCoordinator unit tests	Resolved	Jose Torres	Actions
13.	Support select from temp tables	Resolved	Saisai Shao	Actions
14.	update query.status	Resolved	Gabor Somogyi	Actions
15.	update query progress	Resolved	Unassigned	Actions
16.	stateful operators	Resolved	Unassigned	Actions
17.	Control maximum epoch backlog	Resolved	Gabor Somogyi	Actions
18.	add unit tests for ContinuousDataReader hook	Resolved	Unassigned	Actions
19.	watermarks	Resolved	Unassigned	Actions
20.	exactly-once mode	Resolved	Unassigned	Actions

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Jose Torres

Reporter:: Michael Armbrust

Votes:: 24 Vote for this issue

Watchers:: 119 Start watching this issue

Dates

Created:: 30/May/17 21:39

Updated:: 25/May/21 01:43

Resolved:: 25/May/21 01:43

Agile

View on Board

SPIP: Continuous Processing Mode for Structured Streaming

Details

Description

Attachments

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates

Agile

Slack

Issue deployment