Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26221

Improve Spark SQL instrumentation and metrics

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Umbrella
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.4.0
    • None
    • SQL

    Description

      This is an umbrella ticket for various small improvements for better metrics and instrumentation. Some thoughts:

       

      Differentiate query plan that’s writing data out, vs returning data to the driver

      • I.e. ETL & report generation vs interactive analysis
      • This is related to the data sink item below. We need to make sure from the query plan we can tell what a query is doing

      Data sink: Have an operator for data sink, with metrics that can tell us:

      • Write time
      • Number of records written
      • Size of output written
      • Number of partitions modified
      • Metastore update time
      • Also track number of records for collect / limit

      Scan

      • Track file listing time (start and end so we can construct timeline, not just duration)
      • Track metastore operation time
      • Track IO decoding time for row-based input sources; Need to make sure overhead is low

      Shuffle

      • Track read time and write time
      • Decide if we can measure serialization and deserialization

      Client fetch time

      • Sometimes a query take long to run because it is blocked on the client fetching result (e.g. using a result iterator). Record the time blocked on client so we can remove it in measuring query execution time.

      Make it easy to correlate queries with jobs, stages, and tasks belonging to a single query, e.g. dump execution id in task logs?

      Better logging:

      • Enable logging the query execution id and TID in executor logs, and query execution id in driver logs.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            rxin Reynold Xin
            rxin Reynold Xin
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment