[SPARK-37062] Introduce a new data source for providing consistent set of rows per microbatch - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.3.0
Fix Version/s: 3.3.0
Component/s: Structured Streaming
Labels:
- releasenotes

Description

The "rate" data source has been known to be used as a benchmark for streaming query.

While this helps to put the query to the limit (how many rows the query could process per second), the rate data source doesn't provide consistent rows per batch into stream, which leads two environments be hard to compare with.

For example, in many cases, you may want to compare the metrics in the batches between test environments (like running same streaming query with different options). These metrics are strongly affected if the distribution of input rows in batches are changing, especially a micro-batch has been lagged (in any reason) and rate data source produces more input rows to the next batch.

Also, when you test against streaming aggregation, you may want the data source produces the same set of input rows per batch (deterministic), so that you can plan how these input rows will be aggregated and how state rows will be evicted, and craft the test query based on the plan.

The requirements of new data source would follow:

it should produce a specific number of input rows as requested
it should also include a timestamp (event time) into each row
- to make the input rows fully deterministic, timestamp should be configured as well (like start timestamp & amount of advance per batch)

Attachments

Issue Links

links to

[Github] Pull Request #34333 (HeartSaVioR)

Activity

People

Assignee:: Jungtaek Lim

Reporter:: Jungtaek Lim

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 19/Oct/21 20:45

Updated:: 16/Mar/22 07:28

Resolved:: 01/Nov/21 11:04