Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21858

Make Spark grouping_id() compatible with Hive grouping__id

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 2.2.0
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
      None

      Description

      If you want to migrate some ETLs using `grouping_id` in Hive to Spark and use Spark `grouping_id()` instead of Hive `grouping_id`, you will find difference between their evaluations.

      Here is an example.

      select A, B, grouping__id/grouping_id() from t group by A, B grouping sets((), (A), (B), (A,B))
      

      Running it on Hive and Spark separately, you'll find this: (the selected attribute in selected grouping set is represented by and otherwise by )

      A B Binary Expression in Spark Spark Hive Binary Expression in Hive B A
      11 3 0 00
      10 2 2 10
      01 1 1 01
      00 0 3 11

      As shown above´╝îIn Hive, set to 0, set to 1, and in Spark it's opposite.
      Moreover, attributes in `group by` will reverse firstly in Hive. In Spark it'll be evaluated directly.

      In my opinion, I suggest that modifying the behavior of `grouping_id()` make it compatible with Hive `grouping__id`.

        Issue Links

          Activity

          Hide
          dongjoon Dongjoon Hyun added a comment -

          Hi, Yann Byron.
          Thank you for investigating this and nice descriptions.
          It seems there is a related issue, HIVE-12833, about this. I'm wondering about how you think about that.

          Show
          dongjoon Dongjoon Hyun added a comment - Hi, Yann Byron . Thank you for investigating this and nice descriptions. It seems there is a related issue, HIVE-12833 , about this. I'm wondering about how you think about that.
          Hide
          dongjoon Dongjoon Hyun added a comment -

          I'm adding SPARK-21055, too. IIUC, SPARK-21055 is implementing the syntax and this issue is suggesting the semantics.

          Show
          dongjoon Dongjoon Hyun added a comment - I'm adding SPARK-21055 , too. IIUC, SPARK-21055 is implementing the syntax and this issue is suggesting the semantics.
          Hide
          _Yann_ Yann Byron added a comment -

          Dongjoon Hyun
          Thank you for your reply.

          SPARK-21055 just makes `grouping__id` usable in Spark.
          But the behavior is different from hive, i.e. the grouping ids generated in Spark and Hive are different.

          I'm not very sure about HIVE-12833. Can we make it same with Hive, though there is a bug in Hive, as mentioned in HIVE-12833.

          Show
          _Yann_ Yann Byron added a comment - Dongjoon Hyun Thank you for your reply. SPARK-21055 just makes `grouping__id` usable in Spark. But the behavior is different from hive, i.e. the grouping ids generated in Spark and Hive are different. I'm not very sure about HIVE-12833 . Can we make it same with Hive, though there is a bug in Hive, as mentioned in HIVE-12833 .
          Hide
          dongjoon Dongjoon Hyun added a comment -

          Given that the reporter of that Hive issue is a Spark committer, I guess this Spark feature was designed intentionally like this.

          Show
          dongjoon Dongjoon Hyun added a comment - Given that the reporter of that Hive issue is a Spark committer, I guess this Spark feature was designed intentionally like this.
          Hide
          _Yann_ Yann Byron added a comment -

          I know that.
          Due to an amount of queries need to be migrated from Hive to Spark SQL, I'm afraid that I have to keep the query results same between both.
          So I'll modify the `grouping_id()` behavior in my local version.

          Thank you again.

          Show
          _Yann_ Yann Byron added a comment - I know that. Due to an amount of queries need to be migrated from Hive to Spark SQL, I'm afraid that I have to keep the query results same between both. So I'll modify the `grouping_id()` behavior in my local version. Thank you again.
          Hide
          cloud_fan Wenchen Fan added a comment -

          I think it's a hive bug, and may be fixed in future hive versions. So Spark should stick with the current semantic, which follows Hive document.

          Show
          cloud_fan Wenchen Fan added a comment - I think it's a hive bug, and may be fixed in future hive versions. So Spark should stick with the current semantic, which follows Hive document.
          Hide
          dongjoon Dongjoon Hyun added a comment -

          Thank you for conclusion, Wenchen Fan!

          Show
          dongjoon Dongjoon Hyun added a comment - Thank you for conclusion, Wenchen Fan !

            People

            • Assignee:
              Unassigned
              Reporter:
              _Yann_ Yann Byron
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development