[SPARK-33726] Duplicate field names causes wrong answers during aggregation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.4, 3.0.1
Fix Version/s: 2.4.8, 3.0.2, 3.1.1
Component/s: SQL
Labels:
- correctness

Description

We saw this bug at Workday.

Duplicate field names for different fields can cause org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch#allocate to return a fixed batch when it should have returned a variable batch leading to wrong results.

This example produces wrong results in the spark shell:

scala> sql("with T as (select id as a, -id as x from range(3)), U as (select id as b, cast(id as string) as x from range(3)) select T.x, U.x, min(a) as ma, min(b) as mb from T join U on a=b group by U.x, T.x").show

x	x	ma	mb
-2	2	0	null
-1	1	null	1
0	0	0	0

instead of correct output :

x	x	ma	mb
0	0	0	0
-2	2	2	2
-1	1	1	1

The issue can be solved by iterating over the fields themselves instead of field names.

Attachments

Issue Links

links to

[Github] Pull Request #30788 (yliou)

[Github] Pull Request #31327 (yliou)

[Github] Pull Request #31447 (yliou)

Activity

People

Assignee:: Yian Liou

Reporter:: Yian Liou

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 09/Dec/20 20:49

Updated:: 03/Feb/21 02:36

Resolved:: 25/Jan/21 06:55