[SPARK-41322] Optimized query plan cost/statistics overview - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.3.0
Fix Version/s: None
Component/s: GraphX, SQL
Labels:
None

Description

Motivation

Spark SQL supports running the `EXPLAIN COST` statement on a query to show the optimized logical plan and its data costs per stage (i.e. statistics) https://spark.apache.org/docs/latest/sql-ref-syntax-qry-explain.html. However, it can currently be difficult to determine what the total data read cost will be for a complex query with many stages. Other query engines such as Trino/Presto attempt to provide a general estimate of resource costs of a query when running the `EXPLAIN` statement, which includes CPU, memory, row count, and data size https://trino.io/docs/current/optimizer/cost-in-explain.html.

Proposal

We suggested adding an overview/estimation of the total resources that will be used within the optimized logical plan of a Spark query, or maybe as an alternative, provide this overview/estimation when the `EXPLAIN COST` statement is called on a query. As a first version, it would already be beneficial if this general cost estimation would include anything that is available within the statistics of the optimized query plan, such as:

The amount of data the will be read in bytes
The total amount of rows
etc.

Given that the optimized logical plan is divided in stages it would already be sufficient to show these parameters per stage so they can be aggregated for the entire job later on if needed.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Gerben van der Huizen

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 29/Nov/22 15:10

Updated:: 29/Nov/22 15:13