Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-1930

AggregateExpandDistinctAggregatesRule does not expand aggregates properly

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.14.0
    • Component/s: core
    • Labels:
      None

      Description

      A query with two aggregate calls, sum(comm) and min(comm), it produces an incorrect plan. The main problem is that it fails to index the input of sum(comm) and min(comm) properly. This seems to only happen in the special case where there is only one distinct aggregate call.

      SELECT deptno, sum(comm), min(comm), SUM(DISTINCT sal) FROM emp GROUP BY deptno
      

      AggregateExpandDistinctAggregatesRule produces the following plan in this case.

      LogicalAggregate(group=[{0}], EXPR$1=[SUM($3)], EXPR$2=[MIN($3)], EXPR$3=[SUM($1)])
        LogicalAggregate(group=[{0, 2}], EXPR$1=[SUM($1)], EXPR$2=[MIN($1)])
          LogicalProject(DEPTNO=[$7], COMM=[$6], SAL=[$5])
            LogicalTableScan(table=[[CATALOG, SALES, EMP]])
      

      In the above plan, the top LogicalAggregate EXPR$1=[SUM($3]] is incorrect, it should be SUM($2).

        Activity

        Hide
        minjikim MinJi Kim added a comment -

        Here is an initial fix for this issue that I noticed for converting a single distinct aggregate call. Please let me know if you have any comments or feedback. Thanks!

        https://github.com/apache/calcite/pull/510

        Show
        minjikim MinJi Kim added a comment - Here is an initial fix for this issue that I noticed for converting a single distinct aggregate call. Please let me know if you have any comments or feedback. Thanks! https://github.com/apache/calcite/pull/510
        Hide
        julianhyde Julian Hyde added a comment -

        Thanks for finding and fixing this. A few comments:

        • How crucial is the assumption that there's only one argument? Could we drop it? Is it a new assumption, or did it already exist in the previous code? The standard aggregate functions happen to have one argument, but real ones may have many. Calcite has COUNT(x, y, z) as a built-in. So it would be really useful if we dropped that assumption.
        • Can you please add some tests to agg.iq? Sometimes we don't notice that rule are producing the wrong relational algebra but you can't argue if the results are wrong.
        Show
        julianhyde Julian Hyde added a comment - Thanks for finding and fixing this. A few comments: How crucial is the assumption that there's only one argument? Could we drop it? Is it a new assumption, or did it already exist in the previous code? The standard aggregate functions happen to have one argument, but real ones may have many. Calcite has COUNT(x, y, z) as a built-in. So it would be really useful if we dropped that assumption. Can you please add some tests to agg.iq? Sometimes we don't notice that rule are producing the wrong relational algebra but you can't argue if the results are wrong.
        Hide
        minjikim MinJi Kim added a comment -

        The assumption that the "distinct aggregate call" have only one argument was always there for converting the single distinct call, but never explicitly stated. So I added it to make it clear. I don't think it is inherently necessary, although will need some work to fix. To understand this properly, if someone uses COUNT(distinct x, y, z), it would be counting distinct combination of x, y, z? So, two rows with the following (x, y, z) values – (1, 1, 1), (1, 1, 2) – would be counted separately? I can try to drop that assumption in the next patch. Let me know if I have not understood the problem properly here.

        I will also add some queries to agg.iq in the next patch.

        Show
        minjikim MinJi Kim added a comment - The assumption that the "distinct aggregate call" have only one argument was always there for converting the single distinct call, but never explicitly stated. So I added it to make it clear. I don't think it is inherently necessary, although will need some work to fix. To understand this properly, if someone uses COUNT(distinct x, y, z), it would be counting distinct combination of x, y, z? So, two rows with the following (x, y, z) values – (1, 1, 1), (1, 1, 2) – would be counted separately? I can try to drop that assumption in the next patch. Let me know if I have not understood the problem properly here. I will also add some queries to agg.iq in the next patch.
        Hide
        julianhyde Julian Hyde added a comment -

        Yes (1,1,1) and (1,1,2) will be counted separately. SELECT COUNT(DISTINCT x, y, z) FROM t is identical to SELECT COUNT(x, y, z) FROM (SELECT DISTINCT x, y, z FROM t). And COUNT counts the number of rows for which x, y and z are all not null.

        Can you find a query that has "multiple arguments" and therefore hits this issue? You don't need to fix it, just add it to agg.iq either disabled or giving the wrong answer.

        Show
        julianhyde Julian Hyde added a comment - Yes (1,1,1) and (1,1,2) will be counted separately. SELECT COUNT(DISTINCT x, y, z) FROM t is identical to SELECT COUNT(x, y, z) FROM (SELECT DISTINCT x, y, z FROM t) . And COUNT counts the number of rows for which x, y and z are all not null. Can you find a query that has "multiple arguments" and therefore hits this issue? You don't need to fix it, just add it to agg.iq either disabled or giving the wrong answer.
        Hide
        minjikim MinJi Kim added a comment - - edited

        Here is the updated PR: https://github.com/apache/calcite/pull/510

        I added tests to agg.iq. Also, the changes to the AggregateExpandDistinctAggregatesRule to support distinct aggregate calls with more than one input were not that big, so I made them in this patch. I think it seems to work fine now, so I added tests in both RelOptRulesTest and agg.iq with COUNT(DISTINCT A, B). Thanks!

        Show
        minjikim MinJi Kim added a comment - - edited Here is the updated PR: https://github.com/apache/calcite/pull/510 I added tests to agg.iq. Also, the changes to the AggregateExpandDistinctAggregatesRule to support distinct aggregate calls with more than one input were not that big, so I made them in this patch. I think it seems to work fine now, so I added tests in both RelOptRulesTest and agg.iq with COUNT(DISTINCT A, B). Thanks!
        Hide
        julianhyde Julian Hyde added a comment -

        Reviewing and testing now.

        Show
        julianhyde Julian Hyde added a comment - Reviewing and testing now.
        Show
        julianhyde Julian Hyde added a comment - Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/7321c870 .
        Hide
        michaelmior Michael Mior added a comment -

        Resolved in release 1.14.0 (2017-10-01)

        Show
        michaelmior Michael Mior added a comment - Resolved in release 1.14.0 (2017-10-01)

          People

          • Assignee:
            julianhyde Julian Hyde
            Reporter:
            minjikim MinJi Kim
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development