Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25461

PySpark Pandas UDF outputs incorrect results when input columns contain None

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.1
    • 3.0.0
    • PySpark
    • None
    • I reproduced this issue by running pyspark locally on mac:

      Spark version: 2.3.1 pre-built with Hadoop 2.7

      Python library versions: pyarrow==0.10.0, pandas==0.20.2

    Description

      The following PySpark script uses a simple pandas UDF to calculate a column given column 'A'. When column 'A' contains None, the results look incorrect.

      Script: 

       

      import pandas as pd
      import random
      import pyspark
      from pyspark.sql.functions import col, lit, pandas_udf
      
      values = [None] * 30000 + [1.0] * 170000 + [2.0] * 6000000
      random.shuffle(values)
      pdf = pd.DataFrame({'A': values})
      df = spark.createDataFrame(pdf)
      
      @pandas_udf(returnType=pyspark.sql.types.BooleanType())
      def gt_2(column):
          return (column >= 2).where(column.notnull())
      
      calculated_df = (df.select(['A'])
          .withColumn('potential_bad_col', gt_2('A'))
      )
      
      calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) | (col("A").isNull()))
      
      calculated_df.show()
      

       

      Output:

      +---+-----------------+-----------+
      | A|potential_bad_col|correct_col|
      +---+-----------------+-----------+
      |2.0| false| true|
      |2.0| false| true|
      |2.0| false| true|
      |1.0| false| false|
      |2.0| false| true|
      |2.0| false| true|
      |2.0| false| true|
      |2.0| false| true|
      |2.0| false| true|
      |2.0| false| true|
      |2.0| false| true|
      |2.0| false| true|
      |2.0| false| true|
      |2.0| false| true|
      |2.0| false| true|
      |2.0| false| true|
      |2.0| false| true|
      |2.0| false| true|
      |2.0| false| true|
      |2.0| false| true|
      +---+-----------------+-----------+
      only showing top 20 rows
      

      This problem disappears when the number of rows is small or when the input column does not contain None.

      Attachments

        Issue Links

          Activity

            People

              viirya L. C. Hsieh
              xiangcy Chongyuan Xiang
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: