Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.0.0, 3.0.1, 3.1.0
Description
The following query produces incorrect results. The query has two essential features: (1) it contains a string aggregate, resulting in a SortExec node, and (2) it contains a duplicate grouping key, causing RemoveRepetitionFromGroupExpressions to produce a sort order stored as a Stream.
SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string))
FROM table_4
GROUP BY bigint_col_1, bigint_col_9, bigint_col_9
When the sort order is stored as a Stream, the line ordering.map(_.child.genCode(ctx)) in GenerateOrdering#createOrderKeys() produces unpredictable side effects to ctx. This is because genCode(ctx) modifies ctx. When ordering is a Stream, the modifications will not happen immediately as intended, but will instead occur lazily when the returned Stream is used later.
Similar bugs have occurred at least three times in the past: https://issues.apache.org/jira/browse/SPARK-24500, https://issues.apache.org/jira/browse/SPARK-25767, https://issues.apache.org/jira/browse/SPARK-26680.
The fix is to check if ordering is a Stream and force the modifications to happen immediately if so.
Attachments
Issue Links
- is related to
-
SPARK-24500 UnsupportedOperationException when trying to execute Union plan with Stream of children
- Resolved
-
SPARK-25767 Error reported in Spark logs when using the org.apache.spark:spark-sql_2.11:2.3.2 Java library
- Resolved
-
SPARK-26680 StackOverflowError if Stream passed to groupBy
- Resolved
- links to