Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33260

SortExec produces incorrect results if sortOrder is a Stream

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotVotersStop watchingWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments


    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0, 3.0.1, 3.1.0
    • Fix Version/s: 3.0.2, 3.1.0
    • Component/s: SQL
    • Labels:


      The following query produces incorrect results. The query has two essential features: (1) it contains a string aggregate, resulting in a SortExec node, and (2) it contains a duplicate grouping key, causing RemoveRepetitionFromGroupExpressions to produce a sort order stored as a Stream.

      SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string))
      FROM table_4
      GROUP BY bigint_col_1, bigint_col_9, bigint_col_9

      When the sort order is stored as a Stream, the line ordering.map(_.child.genCode(ctx)) in GenerateOrdering#createOrderKeys() produces unpredictable side effects to ctx. This is because genCode(ctx) modifies ctx. When ordering is a Stream, the modifications will not happen immediately as intended, but will instead occur lazily when the returned Stream is used later.

      Similar bugs have occurred at least three times in the past: https://issues.apache.org/jira/browse/SPARK-24500, https://issues.apache.org/jira/browse/SPARK-25767, https://issues.apache.org/jira/browse/SPARK-26680.

      The fix is to check if ordering is a Stream and force the modifications to happen immediately if so.


        Issue Links


          $i18n.getText('security.level.explanation', $currentSelection) Viewable by All Users



              • Created:

                Issue deployment