Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20617

pyspark.sql filtering fails when using ~isin when there are nulls in column

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Bug
    • 2.2.0
    • None
    • PySpark, SQL
    • None
    • Ubuntu Xenial 16.04, Python 3.5

    • Important

    Description

      Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04.

      Enclosed below an example to replicate:

      from pyspark.sql import SparkSession
      from pyspark.sql import functions as sf
      import pandas as pd
      spark = SparkSession.builder.master("local").appName("Word Count").getOrCreate()

      test_df = pd.DataFrame(

      {"col1": [None, None, "a", "b", "c"], "col2": range(5) }

      )

      test_sdf = spark.createDataFrame(test_df)
      test_sdf.show()

      col1 col2
      null 0
      null 1
      a 2
      b 3
      c 4
      1. Below shows when filtering col1 NOT in list ['a'] the col1 rows with null are missing:

      test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
      Or:
      test_sdf.filter(~sf.col("col1").isin(["a"])).show()

      Expecting:

      col1 col2
      null 0
      null 1
      b 3
      c 4

      Got:

      col1 col2
      b 3
      c 4

      My workarounds:

      1. null is considered 'in', so add OR isNull conditon:
      test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
      sf.col("col1").isNull())).show()

      To get:

      col1 col2 isin
      null 0 null
      null 1 null
      c 4 null
      b 3 null

      2. Use left join and filter
      join_df = pd.DataFrame(

      {"col1": ["a"], "isin": 1 }

      )

      join_sdf = spark.createDataFrame(join_df)

      test_sdf.join(join_sdf, on="col1", how="left") \
      .filter(sf.col("isin").isNull()) \
      .show()

      To get:

      col1 col2 isin
      null 0 null
      null 1 null
      c 4 null
      b 3 null

      Thank you

      Attachments

        Activity

          People

            Unassigned Unassigned
            emtl97 Ed Lee
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: