Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25156

Same query returns different result



    • Question
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 2.1.1
    • None
    • Spark Core
      • Spark Version: 2.1.1
      • Java Version: Java 7
      • Scala Version: 2.11.8


      I performed two joins and two left outer join on five tables.

      There are several different results when you run the same query multiple times.

      Table A

      Column a Column b Column c Column d
      Long(nullable: false) Integer(nullable: false) String(nullable: true) String(nullable: false)

      Table B

      Column a Column b
      Long(nullable: false) String(nullable: false)

      Table C

      Column a Column b
      Integer(nullable: false) String(nullable: false)

      Table D

      Column a Column b Column c
      Long(nullable: true) Long(nullable: false) Integer(nullable: false)

      Table E

      Column a Column b Column c
      Long(nullable: false) Integer(nullable: false) String

      Query(Spark SQL)

      select A.c, B.b, C.b, D.c, E.c
      inner join B on A.a = B.a
      inner join C on A.b = C.a
      left outer join D on A.d <=> cast(D.a as string)
      left outer join E on D.b = E.a and D.c = E.b


      I performed above query 10 times, it returns 7 times correct result(count: 830001460) and 3 times incorrect result(count: 830001299)


      + I execute 

      sql("set spark.sql.shuffle.partitions=801")

      before execute query.

      A, B Table has lot of rows but C Table has small dataset, so when i saw physical plan, A<> B join performed with SortMergeJoin and (A,B) <> C join performed with Broadcast hash join.


      And now, i removed set spark.sql.shuffle.partitions statement, it works fine.

      Is this spark sql's bug?


        Issue Links



              Unassigned Unassigned
              leeyh0216 Yonghwan Lee
              0 Vote for this issue
              3 Start watching this issue