Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33260

SortExec produces incorrect results if sortOrder is a Stream

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0, 3.0.1, 3.1.0
    • 3.0.2, 3.1.0
    • SQL

    Description

      The following query produces incorrect results. The query has two essential features: (1) it contains a string aggregate, resulting in a SortExec node, and (2) it contains a duplicate grouping key, causing RemoveRepetitionFromGroupExpressions to produce a sort order stored as a Stream.

      SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string))
      FROM table_4
      GROUP BY bigint_col_1, bigint_col_9, bigint_col_9

      When the sort order is stored as a Stream, the line ordering.map(_.child.genCode(ctx)) in GenerateOrdering#createOrderKeys() produces unpredictable side effects to ctx. This is because genCode(ctx) modifies ctx. When ordering is a Stream, the modifications will not happen immediately as intended, but will instead occur lazily when the returned Stream is used later.

      Similar bugs have occurred at least three times in the past: https://issues.apache.org/jira/browse/SPARK-24500, https://issues.apache.org/jira/browse/SPARK-25767, https://issues.apache.org/jira/browse/SPARK-26680.

      The fix is to check if ordering is a Stream and force the modifications to happen immediately if so.

      Attachments

        Issue Links

          Activity

            People

              ankurd Ankur Dave
              ankurd Ankur Dave
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: