Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
4.0.0
Description
Subqueries with correlation under LIMIT with OFFSET have a correctness bug, introduced recently when support for correlation under OFFSET was enabled but were not handled correctly. (So we went from unsupported, query throws error -> wrong results.)
It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS
It's easy to repro with a query like
create table x(x1 int, x2 int); insert into x values (1, 1), (2, 2); create table y(y1 int, y2 int); insert into y values (1, 1), (1, 2), (2, 4); select * from x where exists (select * from y where x1 = y1 limit 1 offset 2)
Correct result: empty set, see postgres: https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0
Spark result: Array([2,2])
The PR where it was introduced added a test for it, but the golden file results for the test actually were incorrect and we didn't notice. (The bug was initially found by https://github.com/apache/spark/pull/44084)
I'll work on both:
- Adding support for offset in DecorrelateInnerQuery (the transformation is into a filter on row_number window function, similar to limit).
- Adding a feature flag to enable/disable offset in subquery support