Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-37270

Incorect result of filter using isNull condition

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.2.0
    • 3.2.1
    • Spark Core, SQL

    Description

      Simple code that allows to reproduce this issue:

       val frame = Seq((false, 1)).toDF("bool", "number")
      frame
        .checkpoint()
        .withColumn("conditions", when(col("bool"), "I am not null"))
        .filter(col("conditions").isNull)
        .show(false)

      Although "conditions" column is null

       +-----+------+----------+
      |bool |number|conditions|
      +-----+------+----------+
      |false|1     |null      |
      +-----+------+----------+

      empty result is shown.

      Execution plans:

      == Parsed Logical Plan ==
      'Filter isnull('conditions)
      +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS conditions#252]
         +- LogicalRDD [bool#124, number#125], false
      
      == Analyzed Logical Plan ==
      bool: boolean, number: int, conditions: string
      Filter isnull(conditions#252)
      +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS conditions#252]
         +- LogicalRDD [bool#124, number#125], false
      
      == Optimized Logical Plan ==
      LocalRelation <empty>, [bool#124, number#125, conditions#252]
      
      == Physical Plan ==
      LocalTableScan <empty>, [bool#124, number#125, conditions#252]
       

      After removing checkpoint proper result is returned  and execution plans are as follow:

      == Parsed Logical Plan ==
      'Filter isnull('conditions)
      +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS conditions#256]
         +- Project [_1#119 AS bool#124, _2#120 AS number#125]
            +- LocalRelation [_1#119, _2#120]
      
      == Analyzed Logical Plan ==
      bool: boolean, number: int, conditions: string
      Filter isnull(conditions#256)
      +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS conditions#256]
         +- Project [_1#119 AS bool#124, _2#120 AS number#125]
            +- LocalRelation [_1#119, _2#120]
      
      == Optimized Logical Plan ==
      LocalRelation [bool#124, number#125, conditions#256]
      
      == Physical Plan ==
      LocalTableScan [bool#124, number#125, conditions#256]
       

      It seems that the most important difference is LogicalRDD ->  LocalRelation

      There are following ways (workarounds) to retrieve correct result:

      1) remove checkpoint

      2) add explicit .otherwise(null) to when

      3) add checkpoint() or cache() just before filter

      4) downgrade to Spark 3.1.2

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            yumwang Yuming Wang
            tkus Tomasz Kus
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment