Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-43838

Subquery on single table with having clause can't be optimized

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 4.0.0
    • SQL

    Description

      Eg:

      sql("create view t(c1, c2) as values (0, 1), (0, 2), (1, 2)")
      
      sql("select c1, c2, (select count(*) cnt from t t2 where t1.c1 = t2.c1 " +
      "having cnt = 0) from t t1").show() 

      The error will throw:

      [PLAN_VALIDATION_FAILED_RULE_IN_BATCH] Rule org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery in batch Operator Optimization before Inferring Filters generated an invalid plan: The plan becomes unresolved: 'Project [toprettystring(c1#224, Some(America/Los_Angeles)) AS toprettystring(c1)#238, toprettystring(c2#225, Some(America/Los_Angeles)) AS toprettystring(c2)#239, toprettystring(cnt#246L, Some(America/Los_Angeles)) AS toprettystring(scalarsubquery(c1))#240]
      +- 'Project [c1#224, c2#225, CASE WHEN isnull(alwaysTrue#245) THEN 0 WHEN NOT (cnt#222L = 0) THEN null ELSE cnt#222L END AS cnt#246L]
         +- 'Join LeftOuter, (c1#224 = c1#224#244)
            :- Project [col1#226 AS c1#224, col2#227 AS c2#225]
            :  +- LocalRelation [col1#226, col2#227]
            +- Project [cnt#222L, c1#224#244, cnt#222L, c1#224, true AS alwaysTrue#245]
               +- Project [cnt#222L, c1#224 AS c1#224#244, cnt#222L, c1#224]
                  +- Aggregate [c1#224], [count(1) AS cnt#222L, c1#224]
                     +- Project [col1#228 AS c1#224]
                        +- LocalRelation [col1#228, col2#229]The previous plan: Project [toprettystring(c1#224, Some(America/Los_Angeles)) AS toprettystring(c1)#238, toprettystring(c2#225, Some(America/Los_Angeles)) AS toprettystring(c2)#239, toprettystring(scalar-subquery#223 [c1#224 && (c1#224 = c1#224#244)], Some(America/Los_Angeles)) AS toprettystring(scalarsubquery(c1))#240]
      :  +- Project [cnt#222L, c1#224 AS c1#224#244]
      :     +- Filter (cnt#222L = 0)
      :        +- Aggregate [c1#224], [count(1) AS cnt#222L, c1#224]
      :           +- Project [col1#228 AS c1#224]
      :              +- LocalRelation [col1#228, col2#229]
      +- Project [col1#226 AS c1#224, col2#227 AS c2#225]
         +- LocalRelation [col1#226, col2#227] 

       

      The reason are when execute subquery decorrelation, the fields in the subquery but not in having clause are wrongly pull up. This problem only occurs when there contain having clause.

       

      Attachments

        Issue Links

          Activity

            People

              fanjia Jia Fan
              fanjia Jia Fan
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: