[SPARK-24565] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.0
Fix Version/s: 2.4.0
Component/s: Structured Streaming
Labels:
None

Target Version/s:

2.4.0

Description

Currently, the micro-batches in the MicroBatchExecution is not exposed to the user through any public API. This was because we did not want to expose the micro-batches, so that all the APIs we expose, we can eventually support them in the Continuous engine. But now that we have a better sense of building a ContinuousExecution, I am considering adding APIs which will run only the MicroBatchExecution. I have quite a few use cases where exposing the micro-batch output as a dataframe is useful.

Pass the output rows of each batch to a library that is designed only the batch jobs (example, uses many ML libraries need to collect() while learning).
Reuse batch data sources for output whose streaming version does not exist (e.g. redshift data source).
Writer the output rows to multiple places by writing twice for each batch. This is not the most elegant thing to do for multiple-output streaming queries but is likely to be better than running two streaming queries processing the same data twice.

The proposal is to add a method foreachBatch(f: Dataset[T] => Unit) to Scala/Java/Python DataStreamWriter.

Attachments

Issue Links

links to

[Github] Pull Request #21571 (tdas)

GitHub Pull Request #21571

Activity

People

Assignee:: Tathagata Das

Reporter:: Tathagata Das

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 14/Jun/18 23:05

Updated:: 12/Feb/20 11:33

Resolved:: 19/Jun/18 20:57