Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21858

Make Spark grouping_id() compatible with Hive grouping__id

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • 2.2.0
    • None
    • SQL
    • None

    Description

      If you want to migrate some ETLs using `grouping_id` in Hive to Spark and use Spark `grouping_id()` instead of Hive `grouping_id`, you will find difference between their evaluations.

      Here is an example.

      select A, B, grouping__id/grouping_id() from t group by A, B grouping sets((), (A), (B), (A,B))
      

      Running it on Hive and Spark separately, you'll find this: (the selected attribute in selected grouping set is represented by and otherwise by )

      A B Binary Expression in Spark Spark Hive Binary Expression in Hive B A
      11 3 0 00
      10 2 2 10
      01 1 1 01
      00 0 3 11

      As shown above,In Hive, set to 0, set to 1, and in Spark it's opposite.
      Moreover, attributes in `group by` will reverse firstly in Hive. In Spark it'll be evaluated directly.

      In my opinion, I suggest that modifying the behavior of `grouping_id()` make it compatible with Hive `grouping__id`.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              _Yann_ Yann Byron
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: