[SPARK-17863] SELECT distinct does not work if there is a order by clause - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.2, 2.1.0
Component/s: SQL
Labels:
- correctness

Target Version/s:

2.0.2, 2.1.0

Description

select distinct struct.a, struct.b
from (
  select named_struct('a', 1, 'b', 2, 'c', 3) as struct
  union all
  select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
order by struct.a, struct.b

This query generates

+---+---+
|  a|  b|
+---+---+
|  1|  2|
|  1|  2|
+---+---+

The plan is wrong because the analyze somehow added struct#21805 to the project list, which changes the semantic of the distinct (basically, the query is changed to select distinct struct.a, struct.b, struct from select distinct struct.a, struct.b).

== Parsed Logical Plan ==
'Sort ['struct.a ASC, 'struct.b ASC], true
+- 'Distinct
   +- 'Project ['struct.a, 'struct.b]
      +- 'SubqueryAlias tmp
         +- 'Union
            :- 'Project ['named_struct(a, 1, b, 2, c, 3) AS struct#21805]
            :  +- OneRowRelation$
            +- 'Project ['named_struct(a, 1, b, 2, c, 4) AS struct#21806]
               +- OneRowRelation$

== Analyzed Logical Plan ==
a: int, b: int
Project [a#21819, b#21820]
+- Sort [struct#21805.a ASC, struct#21805.b ASC], true
   +- Distinct
      +- Project [struct#21805.a AS a#21819, struct#21805.b AS b#21820, struct#21805]
         +- SubqueryAlias tmp
            +- Union
               :- Project [named_struct(a, 1, b, 2, c, 3) AS struct#21805]
               :  +- OneRowRelation$
               +- Project [named_struct(a, 1, b, 2, c, 4) AS struct#21806]
                  +- OneRowRelation$

== Optimized Logical Plan ==
Project [a#21819, b#21820]
+- Sort [struct#21805.a ASC, struct#21805.b ASC], true
   +- Aggregate [a#21819, b#21820, struct#21805], [a#21819, b#21820, struct#21805]
      +- Union
         :- Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805]
         :  +- OneRowRelation$
         +- Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806]
            +- OneRowRelation$

== Physical Plan ==
*Project [a#21819, b#21820]
+- *Sort [struct#21805.a ASC, struct#21805.b ASC], true, 0
   +- Exchange rangepartitioning(struct#21805.a ASC, struct#21805.b ASC, 200)
      +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], output=[a#21819, b#21820, struct#21805])
         +- Exchange hashpartitioning(a#21819, b#21820, struct#21805, 200)
            +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], output=[a#21819, b#21820, struct#21805])
               +- Union
                  :- *Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805]
                  :  +- Scan OneRowRelation[]
                  +- *Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806]
                     +- Scan OneRowRelation[]

If you use the following query, you will get the correct result

select distinct struct.a, struct.b
from (
  select named_struct('a', 1, 'b', 2, 'c', 3) as struct
  union all
  select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
order by a, b

Attachments

Issue Links

links to

[Github] Pull Request #15489 (davies)

Activity

People

Assignee:: Davies Liu

Reporter:: Yin Huai

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 11/Oct/16 01:59

Updated:: 14/Oct/16 21:46

Resolved:: 14/Oct/16 21:46