Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17099

Incorrect result when HAVING clause is added to group by query

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsDelete
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 2.1.0
    • 2.0.1, 2.1.0
    • SQL

    Description

      Random query generation uncovered the following query which returns incorrect results when run on Spark SQL. This wasn't the original query uncovered by the generator, since I performed a bit of minimization to try to make it more understandable.

      With the following tables:

      val t1 = sc.parallelize(Seq(-234, 145, 367, 975, 298)).toDF("int_col_5")
      val t2 = sc.parallelize(
        Seq(
          (-769, -244),
          (-800, -409),
          (940, 86),
          (-507, 304),
          (-367, 158))
      ).toDF("int_col_2", "int_col_5")
      
      t1.registerTempTable("t1")
      t2.registerTempTable("t2")
      

      Run

      SELECT
        (SUM(COALESCE(t1.int_col_5, t2.int_col_2))),
           ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2)
      FROM t1
      RIGHT JOIN t2
        ON (t2.int_col_2) = (t1.int_col_5)
      GROUP BY GREATEST(COALESCE(t2.int_col_5, 109), COALESCE(t1.int_col_5, -449)),
               COALESCE(t1.int_col_5, t2.int_col_2)
      HAVING (SUM(COALESCE(t1.int_col_5, t2.int_col_2))) > ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2)
      

      In Spark SQL, this returns an empty result set, whereas Postgres returns four rows. However, if I omit the HAVING clause I see that the group's rows are being incorrectly filtered by the HAVING clause:

      +--------------------------------------+---------------------------------------+--+
      | sum(coalesce(int_col_5, int_col_2))  | (coalesce(int_col_5, int_col_2) * 2)  |
      +--------------------------------------+---------------------------------------+--+
      | -507                                 | -1014                                 |
      | 940                                  | 1880                                  |
      | -769                                 | -1538                                 |
      | -367                                 | -734                                  |
      | -800                                 | -1600                                 |
      +--------------------------------------+---------------------------------------+--+
      

      Based on this, the output after adding the HAVING should contain four rows, not zero.

      I'm not sure how to further shrink this in a straightforward way, so I'm opening this bug to get help in triaging further.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            smilegator Xiao Li Assign to me
            joshrosen Josh Rosen
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment