Details
-
Question
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
2.1.1
-
None
-
- Spark Version: 2.1.1
- Java Version: Java 7
- Scala Version: 2.11.8
Description
I performed two joins and two left outer join on five tables.
There are several different results when you run the same query multiple times.
Table A
Column a | Column b | Column c | Column d |
---|---|---|---|
Long(nullable: false) | Integer(nullable: false) | String(nullable: true) | String(nullable: false) |
Table B
Column a | Column b |
---|---|
Long(nullable: false) | String(nullable: false) |
Table C
Column a | Column b |
---|---|
Integer(nullable: false) | String(nullable: false) |
Table D
Column a | Column b | Column c |
---|---|---|
Long(nullable: true) | Long(nullable: false) | Integer(nullable: false) |
Table E
Column a | Column b | Column c |
---|---|---|
Long(nullable: false) | Integer(nullable: false) | String |
Query(Spark SQL)
select A.c, B.b, C.b, D.c, E.c inner join B on A.a = B.a inner join C on A.b = C.a left outer join D on A.d <=> cast(D.a as string) left outer join E on D.b = E.a and D.c = E.b
I performed above query 10 times, it returns 7 times correct result(count: 830001460) and 3 times incorrect result(count: 830001299)
+ I execute
sql("set spark.sql.shuffle.partitions=801")
before execute query.
A, B Table has lot of rows but C Table has small dataset, so when i saw physical plan, A<> B join performed with SortMergeJoin and (A,B) <> C join performed with Broadcast hash join.
And now, i removed set spark.sql.shuffle.partitions statement, it works fine.
Is this spark sql's bug?
Attachments
Issue Links
- duplicates
-
SPARK-23207 Shuffle+Repartition on an DataFrame could lead to incorrect answers
- Resolved
-
SPARK-23243 Shuffle+Repartition on an RDD could lead to incorrect answers
- Resolved