[SPARK-48503] Scalar subquery with group-by and non-equality predicate incorrectly allowed, wrong results - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 4.0.0
Fix Version/s: 4.0.0
Component/s: SQL
Labels:
- pull-request-available

Description

This query is not legal and should give an error, but instead we incorrectly allow it and it returns wrong results.

create table x(x1 int, x2 int);
insert into x values (1, 1);
create table y(y1 int, y2 int);
insert into y values (2, 2), (3, 3);

select *, (select count(*) from y where y1 > x1 group by y1) from x;

It returns two rows, even though there's only one row of x.

The correct result is an error: more than one row returned by a subquery used as an expression (as seen in postgres for example)

This is a longstanding bug. The bug is in CheckAnalysis in checkAggregateInScalarSubquery. It allows grouping columns that are present in correlation predicates, but doesn’t check whether those predicates are equalities - because when that code was written, non-equality correlation wasn’t allowed. Therefore, it looks like this bug has existed since non-equality correlation was added (~2 years ago).

Various other expressions that are not equi-joins between the inner and outer fields hit this too, e.g. `where y1 + y2 = x1 group by y1`.

Another bugged case is if the correlation condition is an equality but it's under another operator like an OUTER JOIN or UNION.

Attachments

Issue Links

links to

GitHub Pull Request #46839

GitHub Pull Request #47388

Activity

People

Assignee:: Jack Chen

Reporter:: Jack Chen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 02/Jun/24 23:44

Updated:: 26/Jul/24 01:37

Resolved:: 03/Jun/24 17:51