Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17863

SELECT distinct does not work if there is a order by clause

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0.2, 2.1.0
    • Component/s: SQL
    • Labels:

      Description

      select distinct struct.a, struct.b
      from (
        select named_struct('a', 1, 'b', 2, 'c', 3) as struct
        union all
        select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
      order by struct.a, struct.b
      

      This query generates

      +---+---+
      |  a|  b|
      +---+---+
      |  1|  2|
      |  1|  2|
      +---+---+
      

      The plan is wrong because the analyze somehow added struct#21805 to the project list, which changes the semantic of the distinct (basically, the query is changed to select distinct struct.a, struct.b, struct from select distinct struct.a, struct.b).

      == Parsed Logical Plan ==
      'Sort ['struct.a ASC, 'struct.b ASC], true
      +- 'Distinct
         +- 'Project ['struct.a, 'struct.b]
            +- 'SubqueryAlias tmp
               +- 'Union
                  :- 'Project ['named_struct(a, 1, b, 2, c, 3) AS struct#21805]
                  :  +- OneRowRelation$
                  +- 'Project ['named_struct(a, 1, b, 2, c, 4) AS struct#21806]
                     +- OneRowRelation$
      
      == Analyzed Logical Plan ==
      a: int, b: int
      Project [a#21819, b#21820]
      +- Sort [struct#21805.a ASC, struct#21805.b ASC], true
         +- Distinct
            +- Project [struct#21805.a AS a#21819, struct#21805.b AS b#21820, struct#21805]
               +- SubqueryAlias tmp
                  +- Union
                     :- Project [named_struct(a, 1, b, 2, c, 3) AS struct#21805]
                     :  +- OneRowRelation$
                     +- Project [named_struct(a, 1, b, 2, c, 4) AS struct#21806]
                        +- OneRowRelation$
      
      == Optimized Logical Plan ==
      Project [a#21819, b#21820]
      +- Sort [struct#21805.a ASC, struct#21805.b ASC], true
         +- Aggregate [a#21819, b#21820, struct#21805], [a#21819, b#21820, struct#21805]
            +- Union
               :- Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805]
               :  +- OneRowRelation$
               +- Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806]
                  +- OneRowRelation$
      
      == Physical Plan ==
      *Project [a#21819, b#21820]
      +- *Sort [struct#21805.a ASC, struct#21805.b ASC], true, 0
         +- Exchange rangepartitioning(struct#21805.a ASC, struct#21805.b ASC, 200)
            +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], output=[a#21819, b#21820, struct#21805])
               +- Exchange hashpartitioning(a#21819, b#21820, struct#21805, 200)
                  +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], output=[a#21819, b#21820, struct#21805])
                     +- Union
                        :- *Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805]
                        :  +- Scan OneRowRelation[]
                        +- *Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806]
                           +- Scan OneRowRelation[]
      

      If you use the following query, you will get the correct result

      select distinct struct.a, struct.b
      from (
        select named_struct('a', 1, 'b', 2, 'c', 3) as struct
        union all
        select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
      order by a, b
      

        Attachments

          Activity

            People

            • Assignee:
              davies Davies Liu
              Reporter:
              yhuai Yin Huai
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: