[SPARK-34545] PySpark Python UDF return inconsistent results when applying 2 UDFs with different return type to 2 columns together - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 3.0.0
Fix Version/s: 3.0.3, 3.1.2, 3.2.0
Component/s: PySpark, SQL
Labels:
- correctness

Description

Python UDF returns inconsistent results between evaluating 2 columns together and evaluating one by one.

The issue occurs after we upgrading to spark3, so seems it doesn't exist in spark2.

How to reproduce it?

df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), (3, "3")])], ['c1', 'c2'])

from pyspark.sql.functions import udf
from pyspark.sql.types import *

def getLastElementWithTimeMaster(data_type):
    def getLastElementWithTime(list_elm):
        # x should be a list of (val, time)
        y = sorted(list_elm, key=lambda x: x[1]) # default is ascending
        return y[-1][0]
    return udf(getLastElementWithTime, data_type)

# Add 2 columns whcih apply Python UDF
df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1"))
df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2"))

# Show the results
df.select("c3").show()
df.select("c4").show()
df.select("c3", "c4").show()

Results:

>>> df.select("c3").show()
+---+                                                                           
| c3|
+---+
|1.0|
|2.0|
|3.1|
+---+
>>> df.select("c4").show()
+---+
| c4|
+---+
|  1|
|  2|
|  3|
+---+
>>> df.select("c3", "c4").show()
+---+----+
| c3|  c4|
+---+----+
|1.0|null|
|2.0|null|
|3.1|   3|
+---+----+

The test was done in branch-3.1 local mode.

Attachments

Issue Links

is duplicated by

SPARK-35108 Pickle produces incorrect key labels for GenericRowWithSchema (data corruption)

Resolved

links to

[Github] Pull Request #31682 (peter-toth)

[Github] Pull Request #31778 (peter-toth)

Activity

People

Assignee:: Peter Toth

Reporter:: Baohe Zhang

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 25/Feb/21 17:29

Updated:: 04/May/21 12:46

Resolved:: 08/Mar/21 01:13