[SPARK-18455] General support for correlated subquery processing - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Story
Status: Resolved
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

Subquery support has been introduced in Spark 2.0. The initial implementation covers the most common subquery use case: the ones used in TPC queries for instance.

Spark currently supports the following subqueries:

Uncorrelated Scalar Subqueries. All cases are supported.
Correlated Scalar Subqueries. We only allow subqueries that are aggregated and use equality predicates.
Predicate Subqueries. IN or Exists type of queries. We allow most predicates, except when they are pulled from under an Aggregate or Window operator. In that case we only support equality predicates.
However this does not cover the full range of possible subqueries. This, in part, has to do with the fact that we currently rewrite all correlated subqueries into a (LEFT/LEFT SEMI/LEFT ANTI) join.

We currently lack supports for the following use cases:

The use of predicate subqueries in a projection.
The use of non-equality predicates below Aggregates and or Window operators.
The use of non-Aggregate subqueries for correlated scalar subqueries.

This JIRA aims to lift these current limitations in subquery processing.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SPARK-18455-scoping-doc.pdf
15/Dec/16 02:15
104 kB
Nattavut Sutyanyong

Issue Links

is related to

SPARK-18966 NOT IN subquery with correlated expressions may return incorrect result

Resolved

SPARK-15370 Some correlated subqueries return incorrect answers

Resolved

SPARK-15832 Embedded IN/EXISTS predicate subquery throws TreeNodeException

Resolved

SPARK-16804 Correlated subqueries containing non-deterministic operators return incorrect results

Resolved

SPARK-17348 Incorrect results from subquery transformation

Resolved

SPARK-18504 Scalar subquery with extra group by columns returning incorrect result

Resolved

SPARK-18578 Full outer join in correlated subquery returns incorrect results

Resolved

SPARK-16161 Ambiguous error message for unsupported correlated predicate subqueries

Resolved

relates to

SPARK-23945 Column.isin() should accept a single-column DataFrame as input

Resolved

(3 is related to, 1 relates to)

Sub-Tasks

1.	EXISTS and Left Semi join do not produce the same plan	Resolved	Dilip Biswal
2.	Support `OuterReference` in projection list of IN correlated subqueries	Resolved	Unassigned
3.	NOT IN subquery with correlated expressions may return incorrect result	Resolved	Nattavut Sutyanyong
4.	First phase: Deferring the correlated predicate pull up to Optimizer phase	Resolved	Dilip Biswal
5.	New test cases for IN/NOT IN subquery	Resolved	kevin yu
6.	New test cases for scalar subquery	Resolved	Nattavut Sutyanyong
7.	New test cases for EXISTS subquery	Resolved	Dilip Biswal
8.	Alternative implementation of NOT IN to Anti-join	Closed	Unassigned
9.	Whitelist LogicalPlan operators allowed in correlated subqueries	Resolved	Nattavut Sutyanyong
10.	Caching logical plans containing subquery expressions does not work.	Resolved	Dilip Biswal
11.	Return a better error message when correlated predicates contain aggregate expression that has mixture of outer and local references	Resolved	Dilip Biswal
12.	Move error reporting for subquery from Analyzer to CheckAnalysis	Resolved	Dilip Biswal

Activity

People

Assignee:: Dilip Biswal

Reporter:: Nattavut Sutyanyong

Shepherd:: Herman van Hövell

Votes:: 8 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 15/Nov/16 21:45

Updated:: 19/Jan/20 07:23

Resolved:: 09/Sep/19 16:29