Details
-
Wish
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
4.0.0, 3.5.3
-
None
Description
As Spark SQL becomes more powerful for both analytics and ELT (with big T), we see more tools are generating and executing SQL to transform data.
Session is a very important mechanism for lineage and usage/cost tracking, especially for the multi-statement ELT cases. Tagging a series of query statements with the higher level business context (such as project, flow_name, job_name, batch_id, start_data_dt, end_data_dt, owner, cost_group, ...) can provide tremendous observability improvement without much overhead. It is not efficient to collect and analyze the scattered query UUID and try to group them together to reconstruct the SESSION. But it is quite easy to allow the SQL client to set the tags when the session is established.
- Presto has Session Properties
- Trino has X-Trino-Session, X-Trino-Client-Info and X-Trino-Client-Tags to carry a list of K/V
- Snowflake has QUERY_TAG to make observability much easier and efficient
- Redshift supports tagging for query as well
It will be great that Spark SQL can set a paved path/recipe for the workload/cost analysis/observability based on the session QUERY_TAG, so that the whole community can follow instead reinventing the wheel.