Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33260

SortExec produces incorrect results if sortOrder is a Stream

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0, 3.0.1, 3.1.0
    • Fix Version/s: 3.0.2, 3.1.0
    • Component/s: SQL
    • Labels:

      Description

      The following query produces incorrect results. The query has two essential features: (1) it contains a string aggregate, resulting in a SortExec node, and (2) it contains a duplicate grouping key, causing RemoveRepetitionFromGroupExpressions to produce a sort order stored as a Stream.

      SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string))
      FROM table_4
      GROUP BY bigint_col_1, bigint_col_9, bigint_col_9

      When the sort order is stored as a Stream, the line ordering.map(_.child.genCode(ctx)) in GenerateOrdering#createOrderKeys() produces unpredictable side effects to ctx. This is because genCode(ctx) modifies ctx. When ordering is a Stream, the modifications will not happen immediately as intended, but will instead occur lazily when the returned Stream is used later.

      Similar bugs have occurred at least three times in the past: https://issues.apache.org/jira/browse/SPARK-24500, https://issues.apache.org/jira/browse/SPARK-25767, https://issues.apache.org/jira/browse/SPARK-26680.

      The fix is to check if ordering is a Stream and force the modifications to happen immediately if so.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ankurd Ankur Dave
                Reporter:
                ankurd Ankur Dave
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: