Details
Description
Python UDF returns inconsistent results between evaluating 2 columns together and evaluating one by one.
The issue occurs after we upgrading to spark3, so seems it doesn't exist in spark2.
How to reproduce it?
df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), (3, "3")])], ['c1', 'c2']) from pyspark.sql.functions import udf from pyspark.sql.types import * def getLastElementWithTimeMaster(data_type): def getLastElementWithTime(list_elm): # x should be a list of (val, time) y = sorted(list_elm, key=lambda x: x[1]) # default is ascending return y[-1][0] return udf(getLastElementWithTime, data_type) # Add 2 columns whcih apply Python UDF df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1")) df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2")) # Show the results df.select("c3").show() df.select("c4").show() df.select("c3", "c4").show()
Results:
>>> df.select("c3").show() +---+ | c3| +---+ |1.0| |2.0| |3.1| +---+ >>> df.select("c4").show() +---+ | c4| +---+ | 1| | 2| | 3| +---+ >>> df.select("c3", "c4").show() +---+----+ | c3| c4| +---+----+ |1.0|null| |2.0|null| |3.1| 3| +---+----+
The test was done in branch-3.1 local mode.
Attachments
Issue Links
- is duplicated by
-
SPARK-35108 Pickle produces incorrect key labels for GenericRowWithSchema (data corruption)
- Resolved
- links to