Details
Description
Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04.
Enclosed below an example to replicate:
from pyspark.sql import SparkSession
from pyspark.sql import functions as sf
import pandas as pd
spark = SparkSession.builder.master("local").appName("Word Count").getOrCreate()
test_df = pd.DataFrame(
{"col1": [None, None, "a", "b", "c"], "col2": range(5) })
test_sdf = spark.createDataFrame(test_df)
test_sdf.show()
col1 | col2 |
null | 0 |
null | 1 |
a | 2 |
b | 3 |
c | 4 |
- Below shows when filtering col1 NOT in list ['a'] the col1 rows with null are missing:
test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()
Expecting:
col1 | col2 |
null | 0 |
null | 1 |
b | 3 |
c | 4 |
Got:
col1 | col2 |
b | 3 |
c | 4 |
My workarounds:
1. null is considered 'in', so add OR isNull conditon:
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()
To get:
col1 | col2 | isin |
null | 0 | null |
null | 1 | null |
c | 4 | null |
b | 3 | null |
2. Use left join and filter
join_df = pd.DataFrame(
)
join_sdf = spark.createDataFrame(join_df)
test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()
To get:
col1 | col2 | isin |
null | 0 | null |
null | 1 | null |
c | 4 | null |
b | 3 | null |
Thank you