[SPARK-40862] Unexpected operators when rewriting scalar subqueries with non-deterministic expressions - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 3.4.0
Component/s: SQL
Labels:
None

Description

Since ~~SPARK-28379~~, Spark has supported non-aggregated single-row correlated subqueries. ~~SPARK-40800~~ handles the majority of the cases where projects can be collapsed. But Spark can throw exceptions for single-row subqueries with non-deterministic expressions. For example:

CREATE TEMP VIEW t1 AS SELECT ARRAY('a', 'b') a 

SELECT (
  SELECT array_sort(a, (i, j) -> rank[i] - rank[j])[0] + r + r AS sorted
  FROM (SELECT MAP('a', 1, 'b', 2) rank, rand() as r)
) FROM t1

This throws an exception:

Unexpected operator Join Inner
:- Aggregate [[a,b]], [[a,b] AS a#253]
:  +- OneRowRelation
+- Project [map(keys: [a,b], values: [1,2]) AS rank#241, rand(86882494013664043) AS r#242]
   +- OneRowRelation
 in correlated subquery

This is because when Spark rewrites correlated subqueries, it checks whether a scalar subquery is subject to the COUNT bug. It splits the query into parts above the aggregate, the aggregate, and the parts below the aggregate (see `splitSubquery` in the `RewriteCorrelatedScalarSubquery` rule).

This pattern is very restrictive and does not work well with non-aggregated single-row subqueries. We should fix this issue.

Attachments

Issue Links

links to

[Github] Pull Request #38336 (allisonwang-db)

Activity

People

Assignee:: Allison Wang

Reporter:: Allison Wang

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 20/Oct/22 21:14

Updated:: 28/Oct/22 04:27

Resolved:: 28/Oct/22 04:27