Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
3.5.1
-
Python 3.11, PySpark 3.5.1, Pandas=2.2.2
Description
A Pandas API DataFrame groupby with as_index=False and a multilevel relabeling, such as
from pyspark import pandas as ps ps.DataFrame({"a": [0, 0], "b": [0, 1]}).groupby("a", as_index=False).agg(b_max=("b", "max"))
fails to include group keys in the resulting DataFrame. This diverges from expected behavior as well as from the behavior of native Pandas, e.g.
actual
b_max 0 1
expected
a b_max 0 0 1
A possible fix is to prepend groupby key columns to order and columns before filtering here: https://github.com/apache/spark/blob/master/python/pyspark/pandas/groupby.py#L327-L328
Attachments
Issue Links
- links to