Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35553 Improve correlated subqueries
  3. SPARK-48503

Scalar subquery with group-by and non-equality predicate incorrectly allowed, wrong results

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 4.0.0
    • 4.0.0
    • SQL

    Description

      This query is not legal and should give an error, but instead we incorrectly allow it and it returns wrong results.

      create table x(x1 int, x2 int);
      insert into x values (1, 1);
      create table y(y1 int, y2 int);
      insert into y values (2, 2), (3, 3);
      
      select *, (select count(*) from y where y1 > x1 group by y1) from x; 

      It returns two rows, even though there's only one row of x.

      The correct result is an error: more than one row returned by a subquery used as an expression (as seen in postgres for example)

       

      This is a longstanding bug. The bug is in CheckAnalysis in checkAggregateInScalarSubquery. It allows grouping columns that are present in correlation predicates, but doesn’t check whether those predicates are equalities -  because when that code was written, non-equality correlation wasn’t allowed. Therefore, it looks like this bug has existed since non-equality correlation was added (~2 years ago).

       

      Various other expressions that are not equi-joins between the inner and outer fields hit this too, e.g. `where y1 + y2 = x1 group by y1`.

      Another bugged case is if the correlation condition is an equality but it's under another operator like an OUTER JOIN or UNION.

      Attachments

        Activity

          People

            jchen5 Jack Chen
            jchen5 Jack Chen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: