[SPARK-50144] Address the limitation of metrics calculation with DSv1 source in streaming query - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 4.0.0
Fix Version/s: 4.0.0
Component/s: Structured Streaming
Labels:
- pull-request-available

Description

In streaming query, we calculate the number of output rows per stream, via collecting the metric from the source nodes in the executed plan.

For DSv2 data sources, the source nodes in the executed plan are always MicroBatchScanExec, and these nodes contain the stream information.

But for DSv1 data sources, the logical node and the physical node representing the scan of the source are technically arbitrary (any logical node and any physical node), hence Spark makes an assumption that the leaf nodes for initial logical plan <=> logical plan for batch N <=> physical plan for batch N are the same so that we can associate these nodes. This is fragile and we have non-trivial number of reports of broken metric.

This ticket aims to address the limitation for DSv1 streaming source; the idea is to scope the logical/physical nodes to the "widely-used set" and pass the stream information into these nodes, so that we can use the same approach of calculating metrics with DSv2 to DSv1 streaming sources.

Attachments

Issue Links

links to

GitHub Pull Request #48676

Activity

People

Assignee:: Jungtaek Lim

Reporter:: Jungtaek Lim

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 28/Oct/24 05:35

Updated:: 05/Nov/24 01:29

Resolved:: 30/Oct/24 21:35