[SPARK-46446] Correctness bug in correlated subquery with OFFSET - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 4.0.0
Fix Version/s: 4.0.0
Component/s: SQL
Labels:
- pull-request-available

Description

Subqueries with correlation under LIMIT with OFFSET have a correctness bug, introduced recently when support for correlation under OFFSET was enabled but were not handled correctly. (So we went from unsupported, query throws error -> wrong results.)

It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS

It's easy to repro with a query like

create table x(x1 int, x2 int);
insert into x values (1, 1), (2, 2);
create table y(y1 int, y2 int);
insert into y values (1, 1), (1, 2), (2, 4);


select * from x where exists (select * from y where x1 = y1 limit 1 offset 2)

Correct result: empty set, see postgres: https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0

Spark result: Array([2,2])

The PR where it was introduced added a test for it, but the golden file results for the test actually were incorrect and we didn't notice. (The bug was initially found by https://github.com/apache/spark/pull/44084)

I'll work on both:

Adding support for offset in DecorrelateInnerQuery (the transformation is into a filter on row_number window function, similar to limit).

Adding a feature flag to enable/disable offset in subquery support

Attachments

Issue Links

links to

GitHub Pull Request #44401

GitHub Pull Request #44415

Activity

People

Assignee:: Jack Chen

Reporter:: Jack Chen

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 18/Dec/23 14:33

Updated:: 18/Jul/24 22:49

Resolved:: 19/Dec/23 05:24