[SPARK-20617] pyspark.sql filtering fails when using ~isin when there are nulls in column - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Bug
Affects Version/s: 2.2.0
Fix Version/s: None
Component/s: PySpark, SQL
Labels:
None
Environment:

Ubuntu Xenial 16.04, Python 3.5

Flags:

Important

Description

Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04.

Enclosed below an example to replicate:

from pyspark.sql import SparkSession
from pyspark.sql import functions as sf
import pandas as pd
spark = SparkSession.builder.master("local").appName("Word Count").getOrCreate()

test_df = pd.DataFrame(

{"col1": [None, None, "a", "b", "c"], "col2": range(5) }

)

test_sdf = spark.createDataFrame(test_df)
test_sdf.show()

col1	col2
null	0
null	1
a	2
b	3
c	4

Below shows when filtering col1 NOT in list ['a'] the col1 rows with null are missing:

test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()

Expecting:

col1	col2
null	0
null	1
b	3
c	4

Got:

col1	col2
b	3
c	4

My workarounds:

1. null is considered 'in', so add OR isNull conditon:
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()

To get:

col1	col2	isin
null	0	null
null	1	null
c	4	null
b	3	null

2. Use left join and filter
join_df = pd.DataFrame(

{"col1": ["a"], "isin": 1 }

)

join_sdf = spark.createDataFrame(join_df)

test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()

To get:

col1	col2	isin
null	0	null
null	1	null
c	4	null
b	3	null

Thank you

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Ed Lee

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 05/May/17 22:49

Updated:: 10/Apr/18 15:29

Resolved:: 19/Oct/17 11:43