[SPARK-21858] Make Spark grouping_id() compatible with Hive grouping__id - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Not A Problem
Affects Version/s: 2.2.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

If you want to migrate some ETLs using `grouping_id` in Hive to Spark and use Spark `grouping_id()` instead of Hive `grouping_id`, you will find difference between their evaluations.

Here is an example.

select A, B, grouping__id/grouping_id() from t group by A, B grouping sets((), (A), (B), (A,B))

Running it on Hive and Spark separately, you'll find this: (the selected attribute in selected grouping set is represented by and otherwise by )

Binary Expression in Spark	Spark	Hive	Binary Expression in Hive
11	3	0	00
10	2	2	10
01	1	1	01
00	0	3	11

As shown above，In Hive, set to 0, set to 1, and in Spark it's opposite.
Moreover, attributes in `group by` will reverse firstly in Hive. In Spark it'll be evaluated directly.

In my opinion, I suggest that modifying the behavior of `grouping_id()` make it compatible with Hive `grouping__id`.

Attachments

Issue Links

relates to

HIVE-12833 GROUPING__ID is wrong

Open

SPARK-21055 Support grouping__id

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Yann Byron

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 29/Aug/17 04:40

Updated:: 10/Mar/19 08:50

Resolved:: 14/Sep/17 06:41