Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11275

[SQL] Incorrect results when using rollup/cube

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.3.0, 1.4.0, 1.5.1
    • 1.6.0
    • SQL
    • None

    Description

      Spark SQL is unable to generate a correct result when the following query using rollup.
      "select a, b, sum(a + b) as sumAB, GROUPING__ID from mytable group by a, b with rollup"

      Spark SQL generates a wrong result:
      [2,4,6,3]
      [2,null,null,1]
      [1,null,null,1]
      [null,null,null,0]
      [1,2,3,3]

      The table mytable is super simple, containing two rows and two columns:
      testData = Seq((1, 2), (2, 4)).toDF("a", "b")

      After turning off codegen, the query plan is like

      == Parsed Logical Plan ==
      'Rollup ['a,'b], unresolvedalias('a),unresolvedalias('b),unresolvedalias('sum(('a + 'b)) AS sumAB#20),unresolvedalias('GROUPING__ID)
      'UnresolvedRelation `mytable`, None

      == Analyzed Logical Plan ==
      a: int, b: int, sumAB: bigint, GROUPING__ID: int
      Aggregate a#2,b#3,grouping__id#23, a#2,b#3,sum(cast((a#2 + b#3) as bigint)) AS sumAB#20L,GROUPING__ID#23
      Expand [0,1,3], a#2,b#3, grouping__id#23
      Subquery mytable
      Project _1#0 AS a#2,_2#1 AS b#3
      LocalRelation _1#0,_2#1, [[1,2],[2,4]]

      == Optimized Logical Plan ==
      Aggregate a#2,b#3,grouping__id#23, a#2,b#3,sum(cast((a#2 + b#3) as bigint)) AS sumAB#20L,GROUPING__ID#23
      Expand [0,1,3], a#2,b#3, grouping__id#23
      LocalRelation a#2,b#3, [[1,2],[2,4]]

      == Physical Plan ==
      Aggregate false, a#2,b#3,grouping__id#23, a#2,b#3,sum(PartialSum#24L) AS sumAB#20L,grouping__id#23
      Exchange hashpartitioning(a#2,b#3,grouping__id#23,5)
      Aggregate true, a#2,b#3,grouping__id#23, a#2,b#3,grouping__id#23,sum(cast((a#2 + b#3) as bigint)) AS PartialSum#24L
      Expand List(null, null, 0),List(a#2, null, 1),List(a#2, b#3, 3), a#2,b#3,grouping__id#23
      LocalTableScan a#2,b#3, [[1,2],[2,4]]

      Below are my observations:

      1. Generation of GROUP__ID looks OK.
      2. The problem still exists no matter whether turning on/off CODEGEN
      3. Rollup still works in a simple query when group-by columns have only one column. For example, "select b, sum(a), GROUPING__ID from mytable group by b with rollup"
      4. The buckets in "HiveDataFrameAnalytcisSuite" are misleading. Unfortunately, they hide the bugs. Although the buckets passed, they just compare the results of SQL and Dataframe. This way is unable to capture the regression when both return the same wrong results.
      5. The same problem also exists in cube. I have not started the investigation in cube, but I believe the root causes should be the same.
      6. It looks like all the logical plans are correct.

      Attachments

        Activity

          People

            a1ray Andrew Ray
            smilegator Xiao Li
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: