Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-1069

In Aggregate, deprecate indicators, and allow GROUPING to be used as an aggregate function

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.14.0
    • Component/s: None
    • Labels:
      None

      Description

      Grouping sets are currently implemented in Calcite using a bit to indicate each
      of the grouping columns. For instance, consider the following group by clause:

      GROUP BY CUBE (a, b)

      The generated Aggregate operator in Calcite will have a row schema consisting of [a, b, GROUPING(a), GROUPING(b)], where GROUPING( x ) is a boolean field indicator which represents whether x is participating in the group by clause.

      In contrast, Hive's implementation stores a single number corresponding to the GROUPING bit vector associated with a row (this is the result of the GROUPING_ID function in RDBMS such as MSSQLServer, Oracle, etc). Thus, the row schema of the Aggregate operator is [a, b, GROUPING_ID(a,b)].

      This difference is creating a mismatch between Calcite and Hive. As of now, we work around this mismatch in the Hive side: we create our own GROUPING_ID function applied over those columns. However, we have some issues related to predicates pushdown, constant propagation, join project transpose rule (HIVE-12923)
      etc., that we need to continue solving as new rules are added to Hive optimizer. In short, this is making the code on the Hive side harder and harder to maintain.

      This jira is intended to modify the implementation on the Calcite side to that we need not make workarounds/hacks in Hive to support Grouping IDs.

        Issue Links

          Activity

          Hide
          julianhyde Julian Hyde added a comment -

          I'm thinking of an alternative solution. Currently, as you know, an Aggregate with more than one grouping set returns more columns than one with only one grouping set. We have been arguing in HIVE-12923 about whether there should be 1 extra column (Hive's preference) or N extra columns (Calcite's preference).

          My new proposal is that there should be no extra columns. We make GROUPING into an aggregate function, and if you want those extra columns you can add calls to GROUPING.

          If the row type of Aggregate is same regardless of the number of grouping sets, it will simplify a bunch of things. For example, it would be easier to write a rule that pushes down the Filter "group_id = 2", because we wouldn't have to worry about disappearing columns, and whether they are used.

          Show
          julianhyde Julian Hyde added a comment - I'm thinking of an alternative solution. Currently, as you know, an Aggregate with more than one grouping set returns more columns than one with only one grouping set. We have been arguing in HIVE-12923 about whether there should be 1 extra column (Hive's preference) or N extra columns (Calcite's preference). My new proposal is that there should be no extra columns. We make GROUPING into an aggregate function, and if you want those extra columns you can add calls to GROUPING . If the row type of Aggregate is same regardless of the number of grouping sets, it will simplify a bunch of things. For example, it would be easier to write a rule that pushes down the Filter "group_id = 2", because we wouldn't have to worry about disappearing columns, and whether they are used.
          Hide
          julianhyde Julian Hyde added a comment -

          I have created a pull request. Please review. This could potentially break Hive, so I need a +1 from a developer involved with Hive.

          I have endeavored to make this backwards compatible, by still allowing Aggregate with indicator = true. But it is not well tested. I strongly suggest that people convert to indicator = false. There are many benefits, for example, rules that were written for non-grouping sets queries should work with grouping sets unchanged or with minor modifications. (See CALCITE-461 for the pain that has caused.)

          Show
          julianhyde Julian Hyde added a comment - I have created a pull request. Please review. This could potentially break Hive, so I need a +1 from a developer involved with Hive. I have endeavored to make this backwards compatible, by still allowing Aggregate with indicator = true. But it is not well tested. I strongly suggest that people convert to indicator = false. There are many benefits, for example, rules that were written for non-grouping sets queries should work with grouping sets unchanged or with minor modifications. (See CALCITE-461 for the pain that has caused.)
          Hide
          ashutoshc Ashutosh Chauhan added a comment -

          Can you please hold off this till 1.13 is out, otherwise it may make Hive's adoption of 1.13 complicated.

          Show
          ashutoshc Ashutosh Chauhan added a comment - Can you please hold off this till 1.13 is out, otherwise it may make Hive's adoption of 1.13 complicated.
          Hide
          julianhyde Julian Hyde added a comment -

          Fair enough, will do.

          However, can you please review (or even do a test integration whether this breaks anything) as early as you can. It's a big change, and I want to get this into master as soon (after 1.13) as I can.

          Show
          julianhyde Julian Hyde added a comment - Fair enough, will do. However, can you please review (or even do a test integration whether this breaks anything) as early as you can. It's a big change, and I want to get this into master as soon (after 1.13) as I can.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Julian Hyde, I'll take a look at it + the integration with Hive.

          I will also start the discussion for the 1.13 release.

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Julian Hyde , I'll take a look at it + the integration with Hive. I will also start the discussion for the 1.13 release.
          Show
          julianhyde Julian Hyde added a comment - Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/1e7ae1c3 .
          Hide
          michaelmior Michael Mior added a comment -

          Resolved in release 1.14.0 (2017-10-01)

          Show
          michaelmior Michael Mior added a comment - Resolved in release 1.14.0 (2017-10-01)

            People

            • Assignee:
              julianhyde Julian Hyde
              Reporter:
              hsubramaniyan Hari Sankar Sivarama Subramaniyan
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development