Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.4.4
-
None
Description
The new pandasUDF has been very helpful in simplifying writing UDFs and more performant. But I cannot relliably use it because of it's different NULL value handling as compared to normal spark.
Here is my understanding ...
In spark, nulls/missing are as follows:
- Float: null/NaN
- Integer: null
- String: null
- DateTime: null
In pandas, null/missing are as follows:
- Float: NaN
- Integer: <not possible>
- String: null
- DateTime: NaT
When I use spark and am using a Pandas UDF, it looks to me like there is information loss, as I am unable to differentiate between null and NaN anymore. Which I could do when I was using the older PythonUDF
>>> @F.pandas_udf("integer", F.PandasUDFType.GROUPED_AGG) ... def pd_sum(x): ... return x.sum() >>> sdf = spark.createDataFrame( [ [1.0, 2.0], [None, 3.0], [float('nan'), 4.0] ], ['a', 'b']) >>> sdf.agg(pd_sum(sdf['a'])).collect() [Row(pd_sum(a)=1.0)] >>> sdf.select(F.sum(sdf['a'])).collect() [Row(sum(a)=nan)]
If I use an integer with NULL values -> the PandasUDF actually gets a float type:
>>> sdf = spark.createDataFrame([ [1, 2.0], [None, 3.0] ], ['a', 'b']) >>> sdf.dtypes [('a', 'bigint'), ('b', 'double')] >>> @F.pandas_udf("integer", F.PandasUDFType.GROUPED_AGG) ... def pd_sum(x): ... print(x) ... return x.sum() >>> sdf.agg(pd_sum(sdf['a'])).collect() 0 1.0 1 NaN Name: _0, dtype: float64 float64 [Row(pd_sum(a)=1)]
I'm not sure whether this is something Spark should handle, but wanted to understand whether there is a plan to manage this ?
Because from what I understand, if someone wants to use pandas DataFrames as of now, they need to make some asusmptions like:
- The entire range of BigInteger will not work, because it gets converted to float (if null values present)
- The float type should have either NaN or NULL - not both