Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30187

NULL handling in PySpark-PandasUDF

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.4.4
    • None
    • PySpark

    Description

      The new pandasUDF has been very helpful in simplifying writing UDFs and more performant. But I cannot relliably use it because of it's different NULL value handling as compared to normal spark.

      Here is my understanding ...

      In spark, nulls/missing are as follows:

      • Float: null/NaN
      • Integer: null
      • String: null
      • DateTime: null

       

      In pandas, null/missing are as follows:

      • Float: NaN
      • Integer: <not possible>
      • String: null
      • DateTime: NaT

      When I use spark and am using a Pandas UDF, it looks to me like there is information loss, as I am unable to differentiate between null and NaN anymore. Which I could do when I was using the older PythonUDF

       

      >>> @F.pandas_udf("integer", F.PandasUDFType.GROUPED_AGG)
      ... def pd_sum(x):
      ...     return x.sum()
      >>> sdf = spark.createDataFrame(
          [ [1.0, 2.0],
            [None, 3.0],
            [float('nan'), 4.0]
          ], ['a', 'b'])
      >>> sdf.agg(pd_sum(sdf['a'])).collect()
      [Row(pd_sum(a)=1.0)]
      >>> sdf.select(F.sum(sdf['a'])).collect()
      [Row(sum(a)=nan)]
      

       

      If I use an integer with NULL values -> the PandasUDF actually gets a float type:

       

      >>> sdf = spark.createDataFrame([ [1, 2.0], [None, 3.0] ], ['a', 'b'])
      
      >>> sdf.dtypes
      [('a', 'bigint'), ('b', 'double')]
      >>> @F.pandas_udf("integer", F.PandasUDFType.GROUPED_AGG)
      ... def pd_sum(x):
      ...     print(x)
      ...     return x.sum()
      >>> sdf.agg(pd_sum(sdf['a'])).collect()
      0 1.0
      1 NaN
      Name: _0, dtype: float64 float64
      [Row(pd_sum(a)=1)]
      

       

      I'm not sure whether this is something Spark should handle, but wanted to understand whether there is a plan to manage this ?

      Because from what I understand, if someone wants to use pandas DataFrames as of now, they need to make some asusmptions like:

      • The entire range of BigInteger will not work, because it gets converted to float (if null values present)
      • The float type should have either NaN or NULL - not both

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            AbdealiJK Abdeali Kothari
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: