[SPARK-26221] Improve Spark SQL instrumentation and metrics - ASF JIRA

XML

Word

Printable

JSON

This is an umbrella ticket for various small improvements for better metrics and instrumentation. Some thoughts:

Differentiate query plan that’s writing data out, vs returning data to the driver

I.e. ETL & report generation vs interactive analysis
This is related to the data sink item below. We need to make sure from the query plan we can tell what a query is doing

Data sink: Have an operator for data sink, with metrics that can tell us:

Scan

Track file listing time (start and end so we can construct timeline, not just duration)
Track metastore operation time
Track IO decoding time for row-based input sources; Need to make sure overhead is low

Shuffle

Client fetch time

Sometimes a query take long to run because it is blocked on the client fetching result (e.g. using a result iterator). Record the time blocked on client so we can remove it in measuring query execution time.

Make it easy to correlate queries with jobs, stages, and tasks belonging to a single query, e.g. dump execution id in task logs?

Better logging:

Enable logging the query execution id and TID in executor logs, and query execution id in driver logs.

contains

SPARK-26139 Support passing shuffle metrics to exchange operator

1.	Scan: track file listing time	Resolved	Unassigned
2.	Scan: track metastore operation time	Resolved	Unassigned
3.	Scan: track decoding time for row-based data sources	Resolved	Unassigned
4.	Instrumentation for query planning time	Resolved	Reynold Xin
5.	Update query tracker to report timeline for phases, rather than duration	Resolved	Reynold Xin
6.	Add queryId to IncrementalExecution	Resolved	Reynold Xin
7.	Metrics in FileSourceScanExec not update correctly while relation.partitionSchema is set	Resolved	Yuanjian Li
8.	Driver-side only metrics support	Resolved	Unassigned