[SPARK-24156] Enable no-data micro batches for more eager streaming state clean up - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.0
Fix Version/s: 2.4.0
Component/s: Structured Streaming
Labels:
None

Target Version/s:

2.4.0

Description

Currently, MicroBatchExecution in Structured Streaming runs batches only when there is new data to process. This is sensible in most cases as we dont want to unnecessarily use resources when there is nothing new to process. However, in some cases of stateful streaming queries, this delays state clean up as well as clean-up based output. For example, consider a streaming aggregation query with watermark-based state cleanup. The watermark is updated after every batch with new data completes. The updated value is used in the next batch to clean up state, and output finalized aggregates in append mode. However, if there is no data, then the next batch does not occur, and cleanup/output gets delayed unnecessarily. This is true for all stateful streaming operators - aggregation, deduplication, joins, mapGroupsWithState

This issue tracks the work to enable no-data batches in MicroBatchExecution. The major challenge is that all the tests of relevant stateful operations add dummy data to force another batch for testing the state cleanup. So a lot of the tests are going to be changed. So my plan is to enable no-data batches for different stateful operators one at a time.

Attachments

Sub-Tasks

There are no Sub-Tasks for this issue.

Activity

People

Assignee:: Tathagata Das

Reporter:: Tathagata Das

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 02/May/18 21:31

Updated:: 03/Feb/22 14:53

Resolved:: 10/Sep/18 13:55