[SPARK-30187] NULL handling in PySpark-PandasUDF - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.4.4
Fix Version/s: None
Component/s: PySpark
Labels:
- bulk-closed

Description

The new pandasUDF has been very helpful in simplifying writing UDFs and more performant. But I cannot relliably use it because of it's different NULL value handling as compared to normal spark.

Here is my understanding ...

In spark, nulls/missing are as follows:

Float: null/NaN
Integer: null
String: null
DateTime: null

In pandas, null/missing are as follows:

Float: NaN
Integer: <not possible>
String: null
DateTime: NaT

When I use spark and am using a Pandas UDF, it looks to me like there is information loss, as I am unable to differentiate between null and NaN anymore. Which I could do when I was using the older PythonUDF

>>> @F.pandas_udf("integer", F.PandasUDFType.GROUPED_AGG)
... def pd_sum(x):
...     return x.sum()
>>> sdf = spark.createDataFrame(
    [ [1.0, 2.0],
      [None, 3.0],
      [float('nan'), 4.0]
    ], ['a', 'b'])
>>> sdf.agg(pd_sum(sdf['a'])).collect()
[Row(pd_sum(a)=1.0)]
>>> sdf.select(F.sum(sdf['a'])).collect()
[Row(sum(a)=nan)]

If I use an integer with NULL values -> the PandasUDF actually gets a float type:

>>> sdf = spark.createDataFrame([ [1, 2.0], [None, 3.0] ], ['a', 'b'])

>>> sdf.dtypes
[('a', 'bigint'), ('b', 'double')]
>>> @F.pandas_udf("integer", F.PandasUDFType.GROUPED_AGG)
... def pd_sum(x):
...     print(x)
...     return x.sum()
>>> sdf.agg(pd_sum(sdf['a'])).collect()
0 1.0
1 NaN
Name: _0, dtype: float64 float64
[Row(pd_sum(a)=1)]

I'm not sure whether this is something Spark should handle, but wanted to understand whether there is a plan to manage this ?

Because from what I understand, if someone wants to use pandas DataFrames as of now, they need to make some asusmptions like:

The entire range of BigInteger will not work, because it gets converted to float (if null values present)
The float type should have either NaN or NULL - not both

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Abdeali Kothari

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/Dec/19 12:10

Updated:: 25/May/21 01:50

Resolved:: 25/May/21 01:42