[SPARK-37865] Spark should not dedup the groupingExpressions when the first child of Union has duplicate columns - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.3.0
Fix Version/s: 3.1.3, 3.0.4, 3.3.0, 3.2.2
Component/s: SQL
Labels:
- correctness

Description

When the first child of Union has duplicate columns like select a, a from t1 union select a, b from t2, spark only use the first column to aggregate the results, which would make the results incorrect, and this behavior is inconsistent with other engines like PostgreSQL, MySQL. We could alias the attribute of the first child of union to resolve this, or you could argue that this is the feature of Spark SQL.

sample query:
select
a,
a
from values (1, 1), (1, 2) as t1(a, b)
UNION
SELECT
a,
b
from values (1, 1), (1, 2) as t2(a, b)

result is
(1,1)

result from PostgreSQL and MySQL
(1,1)
(1,2)