Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.4.4, 3.0.1
Description
We saw this bug at Workday.
Duplicate field names for different fields can cause org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch#allocate to return a fixed batch when it should have returned a variable batch leading to wrong results.
This example produces wrong results in the spark shell:
scala> sql("with T as (select id as a, -id as x from range(3)), U as (select id as b, cast(id as string) as x from range(3)) select T.x, U.x, min(a) as ma, min(b) as mb from T join U on a=b group by U.x, T.x").show
x | x | ma | mb |
-2 | 2 | 0 | null |
-1 | 1 | null | 1 |
0 | 0 | 0 | 0 |
instead of correct output :
x | x | ma | mb |
0 | 0 | 0 | 0 |
-2 | 2 | 2 | 2 |
-1 | 1 | 1 | 1 |
The issue can be solved by iterating over the fields themselves instead of field names.