Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
4.0.0
Description
In streaming query, we calculate the number of output rows per stream, via collecting the metric from the source nodes in the executed plan.
For DSv2 data sources, the source nodes in the executed plan are always MicroBatchScanExec, and these nodes contain the stream information.
But for DSv1 data sources, the logical node and the physical node representing the scan of the source are technically arbitrary (any logical node and any physical node), hence Spark makes an assumption that the leaf nodes for initial logical plan <=> logical plan for batch N <=> physical plan for batch N are the same so that we can associate these nodes. This is fragile and we have non-trivial number of reports of broken metric.
This ticket aims to address the limitation for DSv1 streaming source; the idea is to scope the logical/physical nodes to the "widely-used set" and pass the stream information into these nodes, so that we can use the same approach of calculating metrics with DSv2 to DSv1 streaming sources.
Attachments
Issue Links
- links to