[SPARK-39962] Global aggregation against pandas aggregate UDF does not take the column order into account - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.3, 3.3.0, 3.2.2, 3.4.0
Fix Version/s: 3.1.4, 3.3.1, 3.2.3, 3.4.0
Component/s: PySpark
Labels:
None

Description

import pandas as pd 
from pyspark.sql import functions as f 

@f.pandas_udf("double") 
def AVG(x: pd.Series) -> float: 
    return x.mean() 


abc = spark.createDataFrame([(1.0, 5.0, 17.0)], schema=["a", "b", "c"]) 
abc.agg(AVG("a"), AVG("c")).show()
abc.select("c", "a").agg(AVG("a"), AVG("c")).show()

+------+------+
|AVG(a)|AVG(c)|
+------+------+
|   1.0|  17.0|
+------+------+

+------+------+
|AVG(a)|AVG(c)|
+------+------+
|  17.0|   1.0|
+------+------+

Both have to be the same.

Attachments

Issue Links

links to

[Github] Pull Request #37390 (HyukjinKwon)

[Github] Pull Request #37401 (HyukjinKwon)

[Github] Pull Request #37401 (HyukjinKwon)

[Github] Pull Request #37401 (HyukjinKwon)

Activity

People

Assignee:: Hyukjin Kwon

Reporter:: Hyukjin Kwon

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 03/Aug/22 03:08

Updated:: 12/Dec/22 18:10

Resolved:: 03/Aug/22 07:13