Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46446

Correctness bug in correlated subquery with OFFSET

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 4.0.0
    • 4.0.0
    • SQL

    Description

      Subqueries with correlation under LIMIT with OFFSET have a correctness bug, introduced recently when support for correlation under OFFSET was enabled but were not handled correctly. (So we went from unsupported, query throws error -> wrong results.)

      It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS

       

      It's easy to repro with a query like

      create table x(x1 int, x2 int);
      insert into x values (1, 1), (2, 2);
      create table y(y1 int, y2 int);
      insert into y values (1, 1), (1, 2), (2, 4);
      
      
      select * from x where exists (select * from y where x1 = y1 limit 1 offset 2)

      Correct result: empty set, see postgres: https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0 

      Spark result: Array([2,2])

       

      The PR where it was introduced added a test for it, but the golden file results for the test actually were incorrect and we didn't notice. (The bug was initially found by https://github.com/apache/spark/pull/44084)

      I'll work on both:

      • Adding support for offset in DecorrelateInnerQuery (the transformation is into a filter on row_number window function, similar to limit).
      • Adding a feature flag to enable/disable offset in subquery support

      Attachments

        Activity

          People

            jchen5 Jack Chen
            jchen5 Jack Chen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: