Description
Motivation
Spark SQL supports running the `EXPLAIN COST` statement on a query to show the optimized logical plan and its data costs per stage (i.e. statistics) https://spark.apache.org/docs/latest/sql-ref-syntax-qry-explain.html. However, it can currently be difficult to determine what the total data read cost will be for a complex query with many stages. Other query engines such as Trino/Presto attempt to provide a general estimate of resource costs of a query when running the `EXPLAIN` statement, which includes CPU, memory, row count, and data size https://trino.io/docs/current/optimizer/cost-in-explain.html.
Proposal
We suggested adding an overview/estimation of the total resources that will be used within the optimized logical plan of a Spark query, or maybe as an alternative, provide this overview/estimation when the `EXPLAIN COST` statement is called on a query. As a first version, it would already be beneficial if this general cost estimation would include anything that is available within the statistics of the optimized query plan, such as:
- The amount of data the will be read in bytes
- The total amount of rowsÂ
- etc.
Given that the optimized logical plan is divided in stages it would already be sufficient to show these parameters per stage so they can be aggregated for the entire job later on if needed.