Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-37865

Spark should not dedup the groupingExpressions when the first child of Union has duplicate columns

    XMLWordPrintableJSON

Details

    Description

      When the first child of Union has duplicate columns like select a, a from t1 union select a, b from t2, spark only use the first column to aggregate the results, which would make the results incorrect, and this behavior is inconsistent with other engines like PostgreSQL, MySQL. We could alias the attribute of the first child of union to resolve this, or you could argue that this is the feature of Spark SQL.

      sample query:
      select
      a,
      a
      from values (1, 1), (1, 2) as t1(a, b)
      UNION
      SELECT
      a,
      b
      from values (1, 1), (1, 2) as t2(a, b)

      result is
      (1,1)

      result from PostgreSQL and MySQL
      (1,1)
      (1,2)

      Attachments

        Activity

          People

            karenfeng Karen Feng
            chasingegg Chao Gao
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: