Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35553 Improve correlated subqueries
  3. SPARK-45009

Correlated EXISTS subqueries in join ON condition unsupported and fail with internal error

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 4.0.0
    • SQL

    Description

      They are not handled in decorrelation, and we also don’t have any checks to block them, so these queries have outer references in the query plan leading to internal errors:

      CREATE TEMP VIEW x(x1, x2) AS VALUES (0, 1), (1, 2);
      CREATE TEMP VIEW y(y1, y2) AS VALUES (0, 2), (0, 3);
      CREATE TEMP VIEW z(z1, z2) AS VALUES (0, 2), (0, 3);
      select * from x left join y on x1 = y1 and exists (select * from z where z1 = x1)
      
      Error occurred during query planning: 
      org.apache.spark.sql.catalyst.plans.logical.Filter cannot be cast to org.apache.spark.sql.execution.SparkPlan 

      PullupCorrelatedPredicates#apply and RewritePredicateSubquery only handle subqueries in UnaryNode, it seems to assume that they cannot occur elsewhere, like in a join ON condition.

      We will need to decide whether to block them properly in analysis (i.e. give a proper error for them), or see if we can add support for them.

      Also note, scalar subqueries in the ON condition are unsupported too but return a proper error.

      Attachments

        Activity

          People

            jchen5 Jack Chen
            jchen5 Jack Chen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: