Details
-
Story
-
Status: Resolved
-
Major
-
Resolution: Done
-
None
-
None
-
None
Description
Subquery support has been introduced in Spark 2.0. The initial implementation covers the most common subquery use case: the ones used in TPC queries for instance.
Spark currently supports the following subqueries:
- Uncorrelated Scalar Subqueries. All cases are supported.
- Correlated Scalar Subqueries. We only allow subqueries that are aggregated and use equality predicates.
- Predicate Subqueries. IN or Exists type of queries. We allow most predicates, except when they are pulled from under an Aggregate or Window operator. In that case we only support equality predicates.
However this does not cover the full range of possible subqueries. This, in part, has to do with the fact that we currently rewrite all correlated subqueries into a (LEFT/LEFT SEMI/LEFT ANTI) join.
We currently lack supports for the following use cases:
- The use of predicate subqueries in a projection.
- The use of non-equality predicates below Aggregates and or Window operators.
- The use of non-Aggregate subqueries for correlated scalar subqueries.
This JIRA aims to lift these current limitations in subquery processing.
Attachments
Attachments
Issue Links
- is related to
-
SPARK-18966 NOT IN subquery with correlated expressions may return incorrect result
- Resolved
-
SPARK-15370 Some correlated subqueries return incorrect answers
- Resolved
-
SPARK-15832 Embedded IN/EXISTS predicate subquery throws TreeNodeException
- Resolved
-
SPARK-16804 Correlated subqueries containing non-deterministic operators return incorrect results
- Resolved
-
SPARK-17348 Incorrect results from subquery transformation
- Resolved
-
SPARK-18504 Scalar subquery with extra group by columns returning incorrect result
- Resolved
-
SPARK-18578 Full outer join in correlated subquery returns incorrect results
- Resolved
-
SPARK-16161 Ambiguous error message for unsupported correlated predicate subqueries
- Resolved
- relates to
-
SPARK-23945 Column.isin() should accept a single-column DataFrame as input
- Resolved