Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-50303

Enable QUERY_TAG for SQL Session in Spark SQL

    XMLWordPrintableJSON

Details

    • Wish
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 4.0.0, 3.5.3
    • None
    • SQL

    Description

      As Spark SQL becomes more powerful for both analytics and ELT (with big T), we see more tools are generating and executing SQL to transform data.

      Session is a very important mechanism for lineage and usage/cost tracking, especially for the multi-statement ELT cases. Tagging a series of query statements with the higher level business context (such as project, flow_name, job_name, batch_id, start_data_dt, end_data_dt, owner, cost_group, ...) can provide tremendous observability improvement without much overhead. It is not efficient to collect and analyze the scattered query UUID and try to group them together to reconstruct the SESSION. But it is quite easy to allow the SQL client to set the tags when the session is established.

      • Presto has Session Properties
      • Trino has X-Trino-Session, X-Trino-Client-Info and X-Trino-Client-Tags to carry a list of K/V
      • Snowflake has QUERY_TAG to make observability much easier and efficient
      • Redshift supports tagging for query as well

      It will be great that Spark SQL can set a paved path/recipe for the workload/cost analysis/observability based on the session QUERY_TAG, so that the whole community can follow instead reinventing the wheel.

      Attachments

        Activity

          People

            Unassigned Unassigned
            ericsun2 Eric Sun
            Shant Hovsepian Shant Hovsepian
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: