Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-2051

Rules using Aggregate might check for simple grouping sets incorrectly

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.15.0
    • Component/s: core
    • Labels:
      None

      Description

      CALCITE-1069 removed the indicator columns for Aggregate operators. In some places, the indicator boolean check was replaced by the following check: aggregate.getGroupSets().size() > 1. However, that check is incomplete, it should have been replaced by aggregate.getGroupType() != Group.SIMPLE.

      For instance : https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/rel/rules/AggregateProjectMergeRule.java#L91

        Issue Links

          Activity

          Hide
          julianhyde Julian Hyde added a comment -

          Resolved in release 1.15.0 (2017-12-11).

          Show
          julianhyde Julian Hyde added a comment - Resolved in release 1.15.0 (2017-12-11).
          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/75e838fab . Follow-up in CALCITE-2062 .
          Hide
          julianhyde Julian Hyde added a comment -

          +1 to commit your PR

          Can you log a follow-up case to apply my patch (adding a precondition) after 1.15, and we can re-open the discussion then? You haven't convinced me that there are any use cases for extra "expressive power" of Hive's operator, and unused expressive power places a heavy tax on all users of the algebra.

          Show
          julianhyde Julian Hyde added a comment - +1 to commit your PR Can you log a follow-up case to apply my patch (adding a precondition) after 1.15, and we can re-open the discussion then? You haven't convinced me that there are any use cases for extra "expressive power" of Hive's operator, and unused expressive power places a heavy tax on all users of the algebra.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Do you acknowledge that it was a mistake for Hive to diverge from the standard and have a dual GROUP BY and GROUPING SETS clause?

          It was a mistake indeed, that is out of question. The standard should have been followed from the onset. To be clear, when I started working on Hive, it was already there.

          I just meant that the operator can be more flexible, we do not need to have a one-to-one correspondence between the SQL clause and the operator. Calcite currently does not exploit this flexibility at the algebraic representation level, but Hive does. This will only be relevant for projects that integrate at the algebraic level and do not use Calcite parser too.

          However, if it is more convenient for the developer that the operator has exactly the same semantics as in SQL (less confusion), I am fine with that. I am just asking to enforce this at a new major release 2.0 instead of in 1.15. Maybe not adding the pre-condition was a bug from your perspective, but it gave us the expressive power needed to represent directly the semantics of the Hive group by operator.

          If that is not possible, to lower the impact, I guess we can incorporate my patch to yours, i.e., take action in the rules depending on the Group type (SIMPLE), and include the induce method in the Aggregate class so it can be overridden. That would give us a way to represent the semantics on our end without having the make changes to Hive parser for the time being.

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Do you acknowledge that it was a mistake for Hive to diverge from the standard and have a dual GROUP BY and GROUPING SETS clause? It was a mistake indeed, that is out of question. The standard should have been followed from the onset. To be clear, when I started working on Hive, it was already there. I just meant that the operator can be more flexible, we do not need to have a one-to-one correspondence between the SQL clause and the operator. Calcite currently does not exploit this flexibility at the algebraic representation level, but Hive does. This will only be relevant for projects that integrate at the algebraic level and do not use Calcite parser too. However, if it is more convenient for the developer that the operator has exactly the same semantics as in SQL (less confusion), I am fine with that. I am just asking to enforce this at a new major release 2.0 instead of in 1.15. Maybe not adding the pre-condition was a bug from your perspective, but it gave us the expressive power needed to represent directly the semantics of the Hive group by operator. If that is not possible, to lower the impact, I guess we can incorporate my patch to yours, i.e., take action in the rules depending on the Group type (SIMPLE), and include the induce method in the Aggregate class so it can be overridden. That would give us a way to represent the semantics on our end without having the make changes to Hive parser for the time being.
          Hide
          julianhyde Julian Hyde added a comment -

          By "semantics" I mean what the SQL statement returns to the end user, not from the perspective of the Hive developer. As you point out, the bits returned by GROUPING__ID function will change, because it is driven by the GROUP BY clause. But other than that, paring down the GROUP BY clause has no effect on semantics.

          Do you acknowledge that it was a mistake for Hive to diverge from the standard and have a dual GROUP BY and GROUPING SETS clause?

          I can work with you on lowering the impact on Hive developers, but I want to establish that developer convenience it what is at issue here, not loss of expressive power.

          Show
          julianhyde Julian Hyde added a comment - By "semantics" I mean what the SQL statement returns to the end user, not from the perspective of the Hive developer. As you point out, the bits returned by GROUPING__ID function will change, because it is driven by the GROUP BY clause. But other than that, paring down the GROUP BY clause has no effect on semantics. Do you acknowledge that it was a mistake for Hive to diverge from the standard and have a dual GROUP BY and GROUPING SETS clause? I can work with you on lowering the impact on Hive developers, but I want to establish that developer convenience it what is at issue here, not loss of expressive power.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          I do not believe so. In Hive, given "GROUP BY c1, c2, ... GROUPING SETS ((c11, c12, ), (c21, c22, ...), ...)", AFAICT, you can always remove the "c1, c2, ..." with no change in semantics.

          Removing the group by clause and preserving semantics would imply a rewriting as the one described in my comment above, there is a change in Semantics. In addition, the columns in each grouping set need to be a subset of the columns in the group by set. For instance, GROUP BY c1, c2, c3 GROUPING SETS ((c1), (c3)) is valid too.

          The restriction is enforceable, you can rewrite the expression into an equivalent one, but it is not as simple as writing a union of the grouping sets when we go into Calcite. In addition, we would need to rewrite, i.e., apply an additional function, functions such as groupingId to return the expected value.

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - I do not believe so. In Hive, given "GROUP BY c1, c2, ... GROUPING SETS ((c11, c12, ), (c21, c22, ...), ...)", AFAICT, you can always remove the "c1, c2, ..." with no change in semantics. Removing the group by clause and preserving semantics would imply a rewriting as the one described in my comment above, there is a change in Semantics. In addition, the columns in each grouping set need to be a subset of the columns in the group by set. For instance, GROUP BY c1, c2, c3 GROUPING SETS ((c1), (c3)) is valid too. The restriction is enforceable, you can rewrite the expression into an equivalent one, but it is not as simple as writing a union of the grouping sets when we go into Calcite. In addition, we would need to rewrite, i.e., apply an additional function, functions such as groupingId to return the expected value.
          Hide
          julianhyde Julian Hyde added a comment -

          The operator could be more expressive than SQL

          I do not believe so. In Hive, given "GROUP BY c1, c2, ... GROUPING SETS ((c11, c12, ), (c21, c22, ...), ...)", AFAICT, you can always remove the "c1, c2, ..." with no change in semantics.

          When Hive calls Calcite to create an Aggregate with groupSet and groupSets arguments, if groupSets is not null it is safe to assign groupSet = ImmutableSet.union(groupSets) before calling. This is a straightforward change and I doubt it would set back the release.

          I'm sorry we didn't have the pre-condition in earlier. But because standard SQL doesn't allow specifying both GROUP BY and GROUPING SETS, I never imagined anyone would specify groupSet as anything other than the union of groupSets, as there is no point in doing so.

          Show
          julianhyde Julian Hyde added a comment - The operator could be more expressive than SQL I do not believe so. In Hive, given "GROUP BY c1, c2, ... GROUPING SETS ((c11, c12, ), (c21, c22, ...), ...)", AFAICT, you can always remove the "c1, c2, ..." with no change in semantics. When Hive calls Calcite to create an Aggregate with groupSet and groupSets arguments, if groupSets is not null it is safe to assign groupSet = ImmutableSet.union(groupSets) before calling. This is a straightforward change and I doubt it would set back the release. I'm sorry we didn't have the pre-condition in earlier. But because standard SQL doesn't allow specifying both GROUP BY and GROUPING SETS, I never imagined anyone would specify groupSet as anything other than the union of groupSets, as there is no point in doing so.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Julian Hyde, we can make Hive respect this, but we would need to change the logic in the parser that creates the Calcite plan, including some workaround for functions such as grouping id. With this change, till we do that, we will not be able to upgrade to 1.15, as we will hit the Precondition when we instantiate the operator.
          The operator could be more expressive than SQL, I am not convinced we need to enforce this condition. However, if we do, I think we should hold the fix till we have a new major release, since these semantics for the Aggregate operator are more restrictive than the existing ones, and we would break compatibility with upstream projects, e.g, Hive.

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Julian Hyde , we can make Hive respect this, but we would need to change the logic in the parser that creates the Calcite plan, including some workaround for functions such as grouping id. With this change, till we do that, we will not be able to upgrade to 1.15, as we will hit the Precondition when we instantiate the operator. The operator could be more expressive than SQL, I am not convinced we need to enforce this condition. However, if we do, I think we should hold the fix till we have a new major release, since these semantics for the Aggregate operator are more restrictive than the existing ones, and we would break compatibility with upstream projects, e.g, Hive.
          Show
          julianhyde Julian Hyde added a comment - Jesus Camacho Rodriguez , Please see proposed fix in https://github.com/julianhyde/calcite/tree/2051-minimal-groupSet .
          Hide
          julianhyde Julian Hyde added a comment -

          I think it was a mistake for Calcite to allow such relational expressions. Ideally it should reject an Aggregate if there are columns that are not in any of the grouping sets:

          diff --git a/core/src/main/java/org/apache/calcite/rel/core/Aggregate.java b/core/src/main/java/org/apache/calcite/rel/core/Aggregate.java
          index 93f9948..d76c504 100644
          --- a/core/src/main/java/org/apache/calcite/rel/core/Aggregate.java
          +++ b/core/src/main/java/org/apache/calcite/rel/core/Aggregate.java
          @@ -156,6 +156,9 @@ protected Aggregate(
                   assert groupSet.contains(set);
                 }
               }
          +    Preconditions.checkArgument(
          +        groupSet.equals(ImmutableBitSet.union(this.groupSets)),
          +        "groupSet must be minimal", groupSet, groupSets);
               assert groupSet.length() <= child.getRowType().getFieldCount();
               for (AggregateCall aggCall : aggCalls) {
                 assert typeMatchesInferred(aggCall, Litmus.THROW);
          

          (I was assuming that when I defined SIMPLE, I just forgot to check the pre-condition.) If we were to make that change, can we change Hive to respect it?

          Show
          julianhyde Julian Hyde added a comment - I think it was a mistake for Calcite to allow such relational expressions. Ideally it should reject an Aggregate if there are columns that are not in any of the grouping sets: diff --git a/core/src/main/java/org/apache/calcite/rel/core/Aggregate.java b/core/src/main/java/org/apache/calcite/rel/core/Aggregate.java index 93f9948..d76c504 100644 --- a/core/src/main/java/org/apache/calcite/rel/core/Aggregate.java +++ b/core/src/main/java/org/apache/calcite/rel/core/Aggregate.java @@ -156,6 +156,9 @@ protected Aggregate( assert groupSet.contains(set); } } + Preconditions.checkArgument( + groupSet.equals(ImmutableBitSet.union(this.groupSets)), + "groupSet must be minimal", groupSet, groupSets); assert groupSet.length() <= child.getRowType().getFieldCount(); for (AggregateCall aggCall : aggCalls) { assert typeMatchesInferred(aggCall, Litmus.THROW); (I was assuming that when I defined SIMPLE, I just forgot to check the pre-condition.) If we were to make that change, can we change Hive to respect it?
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -
          select a, b from t group by (a, b) grouping sets (a)
          

          would be equivalent to

          select a, null as "b" from t group by (a)
          

          We could translate it that way when we go into Calcite from Hive, though currently we do not. The operator implementation can represent the semantics expressed above, as it does not impose that the only set in groupingSets has to be equal to groupSets. Thus, my suggestion was to use type SIMPLE (groupSet = groupingSets) for the check.

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - select a, b from t group by (a, b) grouping sets (a) would be equivalent to select a, null as "b" from t group by (a) We could translate it that way when we go into Calcite from Hive, though currently we do not. The operator implementation can represent the semantics expressed above, as it does not impose that the only set in groupingSets has to be equal to groupSets. Thus, my suggestion was to use type SIMPLE (groupSet = groupingSets) for the check.
          Hide
          julianhyde Julian Hyde added a comment -

          I still don't see why Hive is generating wacky relational algebra.

          Is it true that

          select ... from t group by (a, b) grouping sets (a)
          

          is equivalent to

          select ... from t group by grouping sets (a)
          

          and if so, should Hive not translate query 1 the same as query 2?

          Show
          julianhyde Julian Hyde added a comment - I still don't see why Hive is generating wacky relational algebra. Is it true that select ... from t group by (a, b) grouping sets (a) is equivalent to select ... from t group by grouping sets (a) and if so, should Hive not translate query 1 the same as query 2?
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - - edited

          Julian Hyde, I have created a PR with the changes to the rules that contain that same snippet:
          https://github.com/apache/calcite/pull/566

          We were hitting the issue in Hive with a query ... group by (a, b) grouping sets (a) ... and rule AggregateProjectMergeRule. Thus, you end up with grouping sets with a single set that is not equal to the set in the group by clause. This is possible in Hive, but I think it is not standard SQL, and I have not been able to reproduce it in Calcite. The induce method in Aggregate class takes this into account:
          https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/rel/core/Aggregate.java#L470
          That is the reason why this change fixes the issue.

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - - edited Julian Hyde , I have created a PR with the changes to the rules that contain that same snippet: https://github.com/apache/calcite/pull/566 We were hitting the issue in Hive with a query ... group by (a, b) grouping sets (a) ... and rule AggregateProjectMergeRule . Thus, you end up with grouping sets with a single set that is not equal to the set in the group by clause. This is possible in Hive, but I think it is not standard SQL, and I have not been able to reproduce it in Calcite. The induce method in Aggregate class takes this into account: https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/rel/core/Aggregate.java#L470 That is the reason why this change fixes the issue.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - - edited

          Fwiw, the query was grouping by a set of columns and it contained a unique grouping set with a subset of the grouping columns.

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - - edited Fwiw, the query was grouping by a set of columns and it contained a unique grouping set with a subset of the grouping columns.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Julian Hyde, I had some example in Hive, I need to check whether I can reproduce the issue in Calcite with the same query.

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Julian Hyde , I had some example in Hive, I need to check whether I can reproduce the issue in Calcite with the same query.
          Hide
          julianhyde Julian Hyde added a comment -

          Is there a SQL query that illustrates this problem?

          Show
          julianhyde Julian Hyde added a comment - Is there a SQL query that illustrates this problem?

            People

            • Assignee:
              jcamachorodriguez Jesus Camacho Rodriguez
              Reporter:
              jcamachorodriguez Jesus Camacho Rodriguez
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development