Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40822

Use stable derived-column-alias algorithm, suitable for CREATE VIEW

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 3.5.0
    • Spark Core
    • None
    • Important

    Description

      Spark has the ability derive column aliases for expressions if no alias was provided by the user.
      E.g.
      CREATE TABLE T(c1 INT, c2 INT);
      SELECT c1, `(c1 + 1)`, c3 FROM (SELECT c1, c1 + 1, c1 * c2 AS c3 FROM T);

      This is a valuable feature. However, the current implementation works by pretty printing the expression from the logical plan.  This has multiple downsides:

      • The derived names can be unintuitive. For example the brackets in `(c1 + 1)` or outright ugly, such as:
        SELECT `substr(hello, 1, 2147483647)` FROM (SELECT substr('hello', 1)) AS T;
      • We cannot guarantee stability across versions since the logical lan of an expression may change.

      The later is a major reason why we cannot allow CREATE VIEW without a column list except in "trivial" cases.

      CREATE VIEW v AS SELECT c1, c1 + 1, c1 * c2 AS c3 FROM T;
      Not allowed to create a permanent view `spark_catalog`.`default`.`v` without explicitly assigning an alias for expression (c1 + 1).

      There are two way we can go about fixing this:

      1. Stop deriving column aliases from the expression. Instead generate unique names such as `_col_1` based on their position in the select list. This is ugly and takes away the "nice" headers on result sets
      2. Move the derivation of the name upstream. That is instead of pretty printing the logical plan we pretty print the lexer output, or a sanitized version of the expression as typed.
        The statement as typed is stable by definition. The lexer is stable because i has no reason to change. And if it ever did we have a better chance to manage the change.

      In this feature we propose the following semantic:

      1. If the column alias can be trivially derived (some of these can stack), do so:
        • a (qualified) column reference => the unqualified column identifier
          cat.sch.tab.col => col
        • A field reference => the fieldname
          struct.field1.field2 => field2
        • A cast(column AS type) => column
          cast(col1 AS INT) => col1
        • A map lookup with literal key => keyname
          map.key => key
          map['key'] => key
        • A parameter less function => unqualified function name
          current_schema() => current_schema
      2. Take the lexer tokens of the expression, eliminate comments, and append them.
        foo(tab1.c1 + /* this is a plus*/
        1) => `foo(tab1.c1+1)`

       

      Of course we wan this change under a config.
      If the config is set we can allow CREATE VIEW to exploit this and use the derived expressions.

      PS: The exact mechanics of formatting the name is very much debatable. E.g.spaces between token, squeezing out comments - upper casing - preserving quotes or double quotes...)

      Attachments

        Activity

          People

            maxgekk Max Gekk
            srielau Serge Rielau
            Wenchen Fan Wenchen Fan
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: