Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17099

Incorrect result when HAVING clause is added to group by query

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 2.1.0
    • 2.0.1, 2.1.0
    • SQL

    Description

      Random query generation uncovered the following query which returns incorrect results when run on Spark SQL. This wasn't the original query uncovered by the generator, since I performed a bit of minimization to try to make it more understandable.

      With the following tables:

      val t1 = sc.parallelize(Seq(-234, 145, 367, 975, 298)).toDF("int_col_5")
      val t2 = sc.parallelize(
        Seq(
          (-769, -244),
          (-800, -409),
          (940, 86),
          (-507, 304),
          (-367, 158))
      ).toDF("int_col_2", "int_col_5")
      
      t1.registerTempTable("t1")
      t2.registerTempTable("t2")
      

      Run

      SELECT
        (SUM(COALESCE(t1.int_col_5, t2.int_col_2))),
           ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2)
      FROM t1
      RIGHT JOIN t2
        ON (t2.int_col_2) = (t1.int_col_5)
      GROUP BY GREATEST(COALESCE(t2.int_col_5, 109), COALESCE(t1.int_col_5, -449)),
               COALESCE(t1.int_col_5, t2.int_col_2)
      HAVING (SUM(COALESCE(t1.int_col_5, t2.int_col_2))) > ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2)
      

      In Spark SQL, this returns an empty result set, whereas Postgres returns four rows. However, if I omit the HAVING clause I see that the group's rows are being incorrectly filtered by the HAVING clause:

      +--------------------------------------+---------------------------------------+--+
      | sum(coalesce(int_col_5, int_col_2))  | (coalesce(int_col_5, int_col_2) * 2)  |
      +--------------------------------------+---------------------------------------+--+
      | -507                                 | -1014                                 |
      | 940                                  | 1880                                  |
      | -769                                 | -1538                                 |
      | -367                                 | -734                                  |
      | -800                                 | -1600                                 |
      +--------------------------------------+---------------------------------------+--+
      

      Based on this, the output after adding the HAVING should contain four rows, not zero.

      I'm not sure how to further shrink this in a straightforward way, so I'm opening this bug to get help in triaging further.

      Attachments

        Issue Links

          Activity

            People

              smilegator Xiao Li
              joshrosen Josh Rosen
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: