Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18455 General support for correlated subquery processing
  3. SPARK-18582

Whitelist LogicalPlan operators allowed in correlated subqueries

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0
    • 2.1.0
    • SQL
    • None

    Description

      We want to tighten the code that handles correlated subquery to whitelist operators that are allowed in it.

      The current code in def pullOutCorrelatedPredicates looks like

            // Simplify the predicates before pulling them out.
            val transformed = BooleanSimplification(sub) transformUp {
              case f @ Filter(cond, child) => ...
              case p @ Project(expressions, child) => ...
              case a @ Aggregate(grouping, expressions, child) => ...
              case w : Window => ...
              case j @ Join(left, _, RightOuter, _) => ...
              case j @ Join(left, right, FullOuter, _) => ...
              case j @ Join(_, right, jt, _) if !jt.isInstanceOf[InnerLike] => ...
              case u: Union => ...
              case s: SetOperation => ...
              case e: Expand => ...
              case l : LocalLimit => ...
              case g : GlobalLimit => ...
              case s : Sample => ...
              case p =>
                failOnOuterReference(p)
                ...
            }
      

      The code disallows operators in a sub plan of an operator hosting correlation on a case by case basis. As it is today, it only blocks Union, Intersect, Except, Expand LocalLimit GlobalLimit Sample FullOuter and right table of LeftOuter (and left table of RightOuter). That means any LogicalPlan operators that are not in the list above are permitted to be under a correlation point. Is this risky? There are many (30+ at least from browsing the LogicalPlan type hierarchy) operators derived from LogicalPlan class.

      For the case of ScalarSubquery, it explicitly checks that only SubqueryAlias Project Filter Aggregate are allowed (CheckAnalysis.scala around line 126-165 in and after def cleanQuery). We should whitelist which operators are allowed in correlated subqueries. At my first glance, we should allow, in addition to the ones allowed in ScalarSubquery: Join, Distinct, Sort.

      Attachments

        Activity

          People

            nsyca Nattavut Sutyanyong
            nsyca Nattavut Sutyanyong
            Herman van Hövell Herman van Hövell
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: