[SPARK-24634] Add a new metric regarding number of rows later than watermark - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.0
Fix Version/s: 3.1.0
Component/s: Structured Streaming
Labels:
None

Description

Spark filters out late rows which are later than watermark while applying operations which leverage window. While Spark exposes information regarding watermark to StreamingQueryListener, there's no information regarding rows being filtered out due to watermark. The information should help end users to adjust watermark while operating their query.

We could expose metric regarding number of rows later than watermark and being filtered out. It would be ideal to support side-output to consume late rows, but it doesn't look like easy so addressing this first.