Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-48045

Pandas API groupby with multi-agg-relabel ignores as_index=False

    XMLWordPrintableJSON

Details

    Description

      A Pandas API DataFrame groupby with as_index=False and a multilevel relabeling, such as

      from pyspark import pandas as ps
      ps.DataFrame({"a": [0, 0], "b": [0, 1]}).groupby("a", as_index=False).agg(b_max=("b", "max"))

      fails to include group keys in the resulting DataFrame. This diverges from expected behavior as well as from the behavior of native Pandas, e.g.

      actual

         b_max
      0      1 

      expected

         a  b_max
      0  0      1 

       

      A possible fix is to prepend groupby key columns to order and columns before filtering here:  https://github.com/apache/spark/blob/master/python/pyspark/pandas/groupby.py#L327-L328 

       

      Attachments

        Issue Links

          Activity

            People

              sinaiamonkar-sai Saidatt Sinai Amonkar
              pgeez Paul George
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: