Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23463

Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Not A Problem
    • 2.2.1
    • None
    • PySpark
    • None

    Description

      Filter operations were updated in Spark 2.2.0. Cost Based Optimizer was introduced to look at the table stats and decide filter selectivity. However, since then, filter has started behaving unexpectedly for blank values. The operation would not only drop columns with blank values but also filter out rows that actually meet the filter criteria.

      Steps to repro

      Consider a simple dataframe with some blank values as below:

      dev val
      ALL 0.01
      ALL 0.02
      ALL 0.004
      ALL  
      ALL 2.5
      ALL 4.5
      ALL 45

      Running a simple filter operation over val column in this dataframe yields unexpected results. For eg. the following query returned an empty dataframe:

      df.filter(df["val"] > 0)

      dev val

      However, the filter operation works as expected if 0 in filter condition is replaced by float 0.0

      df.filter(df["val"] > 0.0)

      dev val
      ALL 0.01
      ALL 0.02
      ALL 0.004
      ALL 2.5
      ALL 4.5
      ALL 45

       

      Note that this bug only exists in Spark 2.2.0 and later. The previous versions filter as expected for both int (0) and float (0.0) values in the filter condition.

      Also, if there are no blank values, the filter operation works as expected for all versions.

      Attachments

        1. sample
          0.1 kB
          Manan Bakshi

        Activity

          People

            Unassigned Unassigned
            m.bakshi11 Manan Bakshi
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: