[SPARK-48045] Pandas API groupby with multi-agg-relabel ignores as_index=False - ASF JIRA

XML

Word

Printable

JSON

A Pandas API DataFrame groupby with as_index=False and a multilevel relabeling, such as

from pyspark import pandas as ps
ps.DataFrame({"a": [0, 0], "b": [0, 1]}).groupby("a", as_index=False).agg(b_max=("b", "max"))

fails to include group keys in the resulting DataFrame. This diverges from expected behavior as well as from the behavior of native Pandas, e.g.

actual

   b_max
0      1

expected

   a  b_max
0  0      1

A possible fix is to prepend groupby key columns to order and columns before filtering here: https://github.com/apache/spark/blob/master/python/pyspark/pandas/groupby.py#L327-L328

links to

GitHub Pull Request #46391