Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Not A Problem
-
2.2.0
-
None
-
None
Description
If you want to migrate some ETLs using `grouping_id` in Hive to Spark and use Spark `grouping_id()` instead of Hive `grouping_id`, you will find difference between their evaluations.
Here is an example.
select A, B, grouping__id/grouping_id() from t group by A, B grouping sets((), (A), (B), (A,B))
Running it on Hive and Spark separately, you'll find this: (the selected attribute in selected grouping set is represented by and otherwise by )
A B | Binary Expression in Spark | Spark | Hive | Binary Expression in Hive | B A |
---|---|---|---|---|---|
11 | 3 | 0 | 00 | ||
10 | 2 | 2 | 10 | ||
01 | 1 | 1 | 01 | ||
00 | 0 | 3 | 11 |
As shown above,In Hive, set to 0, set to 1, and in Spark it's opposite.
Moreover, attributes in `group by` will reverse firstly in Hive. In Spark it'll be evaluated directly.
In my opinion, I suggest that modifying the behavior of `grouping_id()` make it compatible with Hive `grouping__id`.
Attachments
Issue Links
- relates to
-
HIVE-12833 GROUPING__ID is wrong
- Open
-
SPARK-21055 Support grouping__id
- Resolved