Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-43838

Subquery on single table with having clause can't be optimized

Rank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 4.0.0
    • SQL
    • None

    Description

      Eg:

      sql("create view t(c1, c2) as values (0, 1), (0, 2), (1, 2)")
      
      sql("select c1, c2, (select count(*) cnt from t t2 where t1.c1 = t2.c1 " +
      "having cnt = 0) from t t1").show() 

      The error will throw:

      [PLAN_VALIDATION_FAILED_RULE_IN_BATCH] Rule org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery in batch Operator Optimization before Inferring Filters generated an invalid plan: The plan becomes unresolved: 'Project [toprettystring(c1#224, Some(America/Los_Angeles)) AS toprettystring(c1)#238, toprettystring(c2#225, Some(America/Los_Angeles)) AS toprettystring(c2)#239, toprettystring(cnt#246L, Some(America/Los_Angeles)) AS toprettystring(scalarsubquery(c1))#240]
      +- 'Project [c1#224, c2#225, CASE WHEN isnull(alwaysTrue#245) THEN 0 WHEN NOT (cnt#222L = 0) THEN null ELSE cnt#222L END AS cnt#246L]
         +- 'Join LeftOuter, (c1#224 = c1#224#244)
            :- Project [col1#226 AS c1#224, col2#227 AS c2#225]
            :  +- LocalRelation [col1#226, col2#227]
            +- Project [cnt#222L, c1#224#244, cnt#222L, c1#224, true AS alwaysTrue#245]
               +- Project [cnt#222L, c1#224 AS c1#224#244, cnt#222L, c1#224]
                  +- Aggregate [c1#224], [count(1) AS cnt#222L, c1#224]
                     +- Project [col1#228 AS c1#224]
                        +- LocalRelation [col1#228, col2#229]The previous plan: Project [toprettystring(c1#224, Some(America/Los_Angeles)) AS toprettystring(c1)#238, toprettystring(c2#225, Some(America/Los_Angeles)) AS toprettystring(c2)#239, toprettystring(scalar-subquery#223 [c1#224 && (c1#224 = c1#224#244)], Some(America/Los_Angeles)) AS toprettystring(scalarsubquery(c1))#240]
      :  +- Project [cnt#222L, c1#224 AS c1#224#244]
      :     +- Filter (cnt#222L = 0)
      :        +- Aggregate [c1#224], [count(1) AS cnt#222L, c1#224]
      :           +- Project [col1#228 AS c1#224]
      :              +- LocalRelation [col1#228, col2#229]
      +- Project [col1#226 AS c1#224, col2#227 AS c2#225]
         +- LocalRelation [col1#226, col2#227] 

       

      The reason are when execute subquery decorrelation, the fields in the subquery but not in having clause are wrongly pull up. This problem only occurs when there contain having clause.

       

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            fanjia Jia Fan
            fanjia Jia Fan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment