[SPARK-44550] Wrong semantics for null IN (empty list) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 4.0.0
Component/s: SQL
Labels:
- pull-request-available

Description

null IN (empty list) incorrectly evaluates to null, when it should evaluate to false. (The reason it should be false is because a IN (b1, b2) is defined as a = b1 OR a = b2, and an empty IN list is treated as an empty OR which is false. This is specified by ANSI SQL.)

Many places in Spark execution (In, InSet, InSubquery) and optimization (OptimizeIn, NullPropagation) implemented this wrong behavior. Also note that the Spark behavior for the null IN (empty list) is inconsistent in some places - literal IN lists generally return null (incorrect), while IN/NOT IN subqueries mostly return false/true, respectively (correct) in this case.

This is a longstanding correctness issue which has existed since null support for IN expressions was first added to Spark.

Doc with more details: https://docs.google.com/document/d/1k8AY8oyT-GI04SnP7eXttPDnDj-Ek-c3luF2zL6DPNU/edit

Attachments

Issue Links

links to

GitHub Pull Request #43068

Sub-Tasks

There are no Sub-Tasks for this issue.

Activity

People

Assignee:: Jack Chen

Reporter:: Jack Chen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 26/Jul/23 03:26

Updated:: 26/Sep/23 01:40

Resolved:: 26/Sep/23 01:40