Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21858

Make Spark grouping_id() compatible with Hive grouping__id

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 2.2.0
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
      None

      Description

      If you want to migrate some ETLs using `grouping_id` in Hive to Spark and use Spark `grouping_id()` instead of Hive `grouping_id`, you will find difference between their evaluations.

      Here is an example.

      select A, B, grouping__id/grouping_id() from t group by A, B grouping sets((), (A), (B), (A,B))
      

      Running it on Hive and Spark separately, you'll find this: (the selected attribute in selected grouping set is represented by and otherwise by )

      A B Binary Expression in Spark Spark Hive Binary Expression in Hive B A
      11 3 0 00
      10 2 2 10
      01 1 1 01
      00 0 3 11

      As shown above´╝îIn Hive, set to 0, set to 1, and in Spark it's opposite.
      Moreover, attributes in `group by` will reverse firstly in Hive. In Spark it'll be evaluated directly.

      In my opinion, I suggest that modifying the behavior of `grouping_id()` make it compatible with Hive `grouping__id`.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                _Yann_ Yann Byron
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: