This is an umbrella ticket for various small improvements for better metrics and instrumentation. Some thoughts:
Differentiate query plan that’s writing data out, vs returning data to the driver
- I.e. ETL & report generation vs interactive analysis
- This is related to the data sink item below. We need to make sure from the query plan we can tell what a query is doing
Data sink: Have an operator for data sink, with metrics that can tell us:
- Write time
- Number of records written
- Size of output written
- Number of partitions modified
- Metastore update time
- Also track number of records for collect / limit
- Track file listing time (start and end so we can construct timeline, not just duration)
- Track metastore operation time
- Track IO decoding time for row-based input sources; Need to make sure overhead is low
- Track read time and write time
- Decide if we can measure serialization and deserialization
Client fetch time
- Sometimes a query take long to run because it is blocked on the client fetching result (e.g. using a result iterator). Record the time blocked on client so we can remove it in measuring query execution time.
Make it easy to correlate queries with jobs, stages, and tasks belonging to a single query, e.g. dump execution id in task logs?
- Enable logging the query execution id and TID in executor logs, and query execution id in driver logs.
|Scan: track file listing time||In Progress||Unassigned|
|Scan: track metastore operation time||In Progress||Unassigned|
|Scan: track decoding time for row-based data sources||In Progress||Unassigned|
|Driver-side only metrics support||In Progress||Unassigned|