Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-43413

IN subquery ListQuery has wrong nullability

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 3.5.0
    • SQL
    • None

    Description

      IN subquery expressions are incorrectly marked as non-nullable, even when they are actually nullable. They correctly check the nullability of the left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is currently defined with nullability = false always. This is incorrect and can lead to incorrect query transformations.

      Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN expression returns NULL when the nullable_col is null, but our code marks it as non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> TRUE, transforming the expression to non_nullable_col IN (select nullable_col) , which is an incorrect transformation because NULL values of nullable_col now cause the expression to yield NULL instead of FALSE.

      This bug can potentially lead to wrong results, but in most cases this doesn't directly cause wrong results end-to-end, because IN subqueries are almost always transformed to semi/anti/existence joins in RewritePredicateSubquery, and this rewrite can also incorrectly discard NULLs, which is another bug. But we can observe it causing wrong behavior in unit tests, and it could easily lead to incorrect query results if there are changes to the surrounding context, so it should be fixed regardless.

      This is a long-standing bug that has existed at least since 2016, as long as the ListQuery class has existed.

      Attachments

        Activity

          People

            jchen5 Jack Chen
            jchen5 Jack Chen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: