Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24254

Eagerly evaluate some subqueries over LocalRelation

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • SQL
    • None

    Description

      Some queries would benefit from evaluating subqueries over LocalRelations eagerly. For example:

      SELECT t1.part_col FROM t1 JOIN (SELECT max(part_col) m FROM t2) foo WHERE t1.part_col = foo.m
      

      If max(part_col) could be evaluated during planning, there's an opportunity to prune all but at most one partitions from the scan of t1.

      Similarly, a near-identical query with a non-scalar subquery in the WHERE clause:

      SELECT * FROM t1 WHERE part_col IN (SELECT part_col FROM t2)
      

      could be partially evaluated to eliminate some partitions, and remove the join from the plan.

      Obviously all subqueries over local relations can't be evaluated during planning, but certain whitelisted aggregates could be if the input cardinality isn't too high.

      Attachments

        Activity

          People

            Unassigned Unassigned
            henryr Henry Robinson
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: