[SPARK-40822] Use stable derived-column-alias algorithm, suitable for CREATE VIEW - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 3.5.0
Component/s: Spark Core
Labels:
None

Flags:

Important

Description

Spark has the ability derive column aliases for expressions if no alias was provided by the user.
E.g.
CREATE TABLE T(c1 INT, c2 INT);
SELECT c1, `(c1 + 1)`, c3 FROM (SELECT c1, c1 + 1, c1 * c2 AS c3 FROM T);

This is a valuable feature. However, the current implementation works by pretty printing the expression from the logical plan. This has multiple downsides:

The derived names can be unintuitive. For example the brackets in `(c1 + 1)` or outright ugly, such as:
SELECT `substr(hello, 1, 2147483647)` FROM (SELECT substr('hello', 1)) AS T;
We cannot guarantee stability across versions since the logical lan of an expression may change.

The later is a major reason why we cannot allow CREATE VIEW without a column list except in "trivial" cases.

CREATE VIEW v AS SELECT c1, c1 + 1, c1 * c2 AS c3 FROM T;
Not allowed to create a permanent view `spark_catalog`.`default`.`v` without explicitly assigning an alias for expression (c1 + 1).

There are two way we can go about fixing this:

Stop deriving column aliases from the expression. Instead generate unique names such as `_col_1` based on their position in the select list. This is ugly and takes away the "nice" headers on result sets
Move the derivation of the name upstream. That is instead of pretty printing the logical plan we pretty print the lexer output, or a sanitized version of the expression as typed.
The statement as typed is stable by definition. The lexer is stable because i has no reason to change. And if it ever did we have a better chance to manage the change.

In this feature we propose the following semantic:

If the column alias can be trivially derived (some of these can stack), do so:
- a (qualified) column reference => the unqualified column identifier
  cat.sch.tab.col => col
- A field reference => the fieldname
  struct.field1.field2 => field2
- A cast(column AS type) => column
  cast(col1 AS INT) => col1
- A map lookup with literal key => keyname
  map.key => key
  map['key'] => key
- A parameter less function => unqualified function name
  current_schema() => current_schema
Take the lexer tokens of the expression, eliminate comments, and append them.
foo(tab1.c1 + /* this is a plus*/
1) => `foo(tab1.c1+1)`

Of course we wan this change under a config.
If the config is set we can allow CREATE VIEW to exploit this and use the derived expressions.

PS: The exact mechanics of formatting the name is very much debatable. E.g.spaces between token, squeezing out comments - upper casing - preserving quotes or double quotes...)

Attachments

Issue Links

causes

SPARK-42873 Define Spark SQL types as keywords

Resolved

links to

[Github] Pull Request #39332 (MaxGekk)

[Github] Pull Request #40126 (MaxGekk)

Activity

People

Assignee:: Max Gekk

Reporter:: Serge Rielau

Shepherd:: Wenchen Fan

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/Oct/22 18:08

Updated:: 21/Mar/23 06:14

Resolved:: 21/Mar/23 06:14