Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26221

Improve Spark SQL instrumentation and metrics



    • Umbrella
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.4.0
    • None
    • SQL


      This is an umbrella ticket for various small improvements for better metrics and instrumentation. Some thoughts:


      Differentiate query plan that’s writing data out, vs returning data to the driver

      • I.e. ETL & report generation vs interactive analysis
      • This is related to the data sink item below. We need to make sure from the query plan we can tell what a query is doing

      Data sink: Have an operator for data sink, with metrics that can tell us:

      • Write time
      • Number of records written
      • Size of output written
      • Number of partitions modified
      • Metastore update time
      • Also track number of records for collect / limit


      • Track file listing time (start and end so we can construct timeline, not just duration)
      • Track metastore operation time
      • Track IO decoding time for row-based input sources; Need to make sure overhead is low


      • Track read time and write time
      • Decide if we can measure serialization and deserialization

      Client fetch time

      • Sometimes a query take long to run because it is blocked on the client fetching result (e.g. using a result iterator). Record the time blocked on client so we can remove it in measuring query execution time.

      Make it easy to correlate queries with jobs, stages, and tasks belonging to a single query, e.g. dump execution id in task logs?

      Better logging:

      • Enable logging the query execution id and TID in executor logs, and query execution id in driver logs.


        Issue Links



              rxin Reynold Xin
              rxin Reynold Xin
              1 Vote for this issue
              8 Start watching this issue