Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-512

Add GROUP_ID and GROUPING_ID functions

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0.0-incubating
    • Component/s: None
    • Labels:
      None

      Description

      Implement GROUP_ID and GROUPING_ID(expr [, expr]...) functions, as they are defined in Oracle.

      We already have, as of CALCITE-370, the GROUPING(expr) function. Now define GROUPING_ID:

      GROUPING_ID(e2, e1, e0)
      == GROUPING(e2) * 4
      + GROUPING(e1) * 2
      + GROUPING(e0)

      and similarly for different numbers of arguments.

      GROUP_ID() is equivalent to GROUPING_ID(x, y, z), where x, y, z are the expressions being grouped.

        Activity

        Hide
        jcamachorodriguez Jesus Camacho Rodriguez added a comment -

        Julian Hyde, I have implemented the GROUPING__ID function using Calcite for HIVE-8988 (patch ready but not applied till HIVE-8974 goes through). I can help with this if you give me some hints about the sources that need to be adapted.

        Show
        jcamachorodriguez Jesus Camacho Rodriguez added a comment - Julian Hyde , I have implemented the GROUPING__ID function using Calcite for HIVE-8988 (patch ready but not applied till HIVE-8974 goes through). I can help with this if you give me some hints about the sources that need to be adapted.
        Hide
        julianhyde Julian Hyde added a comment -

        Thanks for the help. I think it's fairly straightforward, building on work I did for GROUPING(e) - I abstracted a base class from SqlGroupingFunction and it has most of the functionality required to validate these new functions.

        It's a shame, and confusing, that Hive's functions are not compatible with Oracle's. Hive's GROUPING__ID() (why two underscores?!) is similar to Oracle's GROUP_ID(), but there's nothing like Oracle's GROUPING_ID(e [, e]...).

        Show
        julianhyde Julian Hyde added a comment - Thanks for the help. I think it's fairly straightforward, building on work I did for GROUPING(e) - I abstracted a base class from SqlGroupingFunction and it has most of the functionality required to validate these new functions. It's a shame, and confusing, that Hive's functions are not compatible with Oracle's. Hive's GROUPING__ID() (why two underscores?!) is similar to Oracle's GROUP_ID() , but there's nothing like Oracle's GROUPING_ID(e [, e] ...) .
        Hide
        julianhyde Julian Hyde added a comment -

        I see you have assigned this to yourself. If you want to do this, you should start from my branch https://github.com/julianhyde/incubator-calcite/commit/f49276b6bab3115945f0952c4121c2dbd6da0a9e. Or assign it back to me.

        The parsing and validation is done; we just need to translate to an expression involving indicator columns. SqlToRelConverter.lookupAggregates is probably where the work needs to be done. Run JdbcTest.testRunAgg (which calls agg.oq) and make sure the results look good.

        GROUP_ID and GROUPING_ID should probably return a BIGINT, not INTEGER as at present. That will allow for 63 grouping expressions rather than 31.

        What do you think should be the behavior if an argument to GROUPING_ID occurs multiple times, or if arguments are not in the same order as in the GROUP BY clause? I think we already handle the case where an argument to GROUPING_ID is not being grouped.

        Show
        julianhyde Julian Hyde added a comment - I see you have assigned this to yourself. If you want to do this, you should start from my branch https://github.com/julianhyde/incubator-calcite/commit/f49276b6bab3115945f0952c4121c2dbd6da0a9e . Or assign it back to me. The parsing and validation is done; we just need to translate to an expression involving indicator columns. SqlToRelConverter.lookupAggregates is probably where the work needs to be done. Run JdbcTest.testRunAgg (which calls agg.oq) and make sure the results look good. GROUP_ID and GROUPING_ID should probably return a BIGINT, not INTEGER as at present. That will allow for 63 grouping expressions rather than 31. What do you think should be the behavior if an argument to GROUPING_ID occurs multiple times, or if arguments are not in the same order as in the GROUP BY clause? I think we already handle the case where an argument to GROUPING_ID is not being grouped.
        Hide
        jcamachorodriguez Jesus Camacho Rodriguez added a comment - - edited

        Sorry Julian Hyde, I thought you had not started working on this (I did not see the status change in the ticket), so I was going to take care of it. Then maybe it is easier that you finish it, and I continue working on CALCITE-461? Your call.

        About your questions:

        • If an argument to GROUPING_ID appears multiple times (I don't know why somebody would want that), I guess there are two approaches: (i) throw an error, or (ii) allow it (and maybe throw a warning?). I support the second approach, so the user could have freedom to input any of the grouping columns (repeated or in any order) to the function.
        • I think for the GROUPING_ID function, we should only take into account its parameters and their order. Thus, as you stated:
          GROUPING_ID(e2, e1, e0)
          == GROUPING(e2) * 4 
          + GROUPING(e1) * 2
          + GROUPING(e0)
          

          independently of whether we have grouped by e.g. e2, e1, e0 or e1, e2, e0.

        I think we already handle the case where an argument to GROUPING_ID is not being grouped.

        What do you mean? Do we throw an error, or...?

        Show
        jcamachorodriguez Jesus Camacho Rodriguez added a comment - - edited Sorry Julian Hyde , I thought you had not started working on this (I did not see the status change in the ticket), so I was going to take care of it. Then maybe it is easier that you finish it, and I continue working on CALCITE-461 ? Your call. About your questions: If an argument to GROUPING_ID appears multiple times (I don't know why somebody would want that), I guess there are two approaches: (i) throw an error, or (ii) allow it (and maybe throw a warning?). I support the second approach, so the user could have freedom to input any of the grouping columns (repeated or in any order) to the function. I think for the GROUPING_ID function, we should only take into account its parameters and their order. Thus, as you stated: GROUPING_ID(e2, e1, e0) == GROUPING(e2) * 4 + GROUPING(e1) * 2 + GROUPING(e0) independently of whether we have grouped by e.g. e2, e1, e0 or e1, e2, e0 . I think we already handle the case where an argument to GROUPING_ID is not being grouped. What do you mean? Do we throw an error, or...?
        Hide
        vladimirsitnikov Vladimir Sitnikov added a comment -

        ndependently of whether we have grouped by e.g. e2, e1, e0 or e1, e2, e0.

        +1

        If an argument to GROUPING_ID appears multiple times

        Can you give a sample query for this case?

        Show
        vladimirsitnikov Vladimir Sitnikov added a comment - ndependently of whether we have grouped by e.g. e2, e1, e0 or e1, e2, e0. +1 If an argument to GROUPING_ID appears multiple times Can you give a sample query for this case?
        Hide
        julianhyde Julian Hyde added a comment -

        Here's an example of multiple expressions:

        select deptno, gender, grouping_id(deptno, gender, deptno), count(*)
        from emp
        group by cube(gender, deptno)
        

        If a row represents a rollup of deptno but not gender then grouping_id(deptno, gender, deptno) evaluates to 5 (binary 101).

        Show
        julianhyde Julian Hyde added a comment - Here's an example of multiple expressions: select deptno, gender, grouping_id(deptno, gender, deptno), count(*) from emp group by cube(gender, deptno) If a row represents a rollup of deptno but not gender then grouping_id(deptno, gender, deptno) evaluates to 5 (binary 101).
        Hide
        vladimirsitnikov Vladimir Sitnikov added a comment -

        I see no reason to fail in this case.
        I think those queries can easily come out of ORM.

        Show
        vladimirsitnikov Vladimir Sitnikov added a comment - I see no reason to fail in this case. I think those queries can easily come out of ORM.
        Show
        julianhyde Julian Hyde added a comment - Fixed in http://git-wip-us.apache.org/repos/asf/incubator-calcite/commit/f2a67050 .
        Hide
        julianhyde Julian Hyde added a comment -

        Closing now that 1.0.0-incubating has been released.

        Show
        julianhyde Julian Hyde added a comment - Closing now that 1.0.0-incubating has been released.

          People

          • Assignee:
            julianhyde Julian Hyde
            Reporter:
            julianhyde Julian Hyde
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development