Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-1984

Incorrect rewriting with materialized views using DISTINCT in aggregate functions

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.13.0
    • Fix Version/s: 1.15.0
    • Component/s: core
    • Labels:
      None

      Description

      I had written a test as follows . It runs to succeed . But , the result seems not right . The “count distinct” aggregation should not be changed to “count”.

      @Test public void testAggregateMaterializationOnCountDistinctQuery1() {
          checkMaterialize(
              "select \"deptno\",  \"empid\" ,\"salary\"\n"
                  + "from \"emps\" group by \"deptno\",  \"empid\",\"salary\"",
              "select \"deptno\", count( distinct \"empid\")" + "from (select \"deptno\",  \"empid\" \n"
                  + "from \"emps\" group by \"deptno\",  \"empid\")  group by \"deptno\"",
              HR_FKUK_MODEL,
              CalciteAssert.checkResultContains(
                  "EnumerableAggregate(group=[{0}], S=[COUNT($1)])\n" +
                      "  EnumerableTableScan(table=[[hr, m0]]"));
        }
      

        Activity

        Hide
        julianhyde Julian Hyde added a comment -

        Resolved in release 1.15.0 (2017-12-11).

        Show
        julianhyde Julian Hyde added a comment - Resolved in release 1.15.0 (2017-12-11).
        Hide
        jcamachorodriguez Jesus Camacho Rodriguez added a comment - - edited
        Show
        jcamachorodriguez Jesus Camacho Rodriguez added a comment - - edited Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/20eee6313 . Thanks for reporting fei.chen .
        Hide
        chenfeisd@163.com fei.chen added a comment - - edited

        Julian Hyde , Jesus Camacho Rodriguez thanks a lot.

        Show
        chenfeisd@163.com fei.chen added a comment - - edited Julian Hyde , Jesus Camacho Rodriguez thanks a lot.
        Hide
        julianhyde Julian Hyde added a comment -

        That optimization should already happen, if you use the current rewrite rule. It pushes down "group by deptno, empid", which is discovered to be trivial and removed. It's worth adding a test case to make sure this happens.

        Show
        julianhyde Julian Hyde added a comment - That optimization should already happen, if you use the current rewrite rule. It pushes down "group by deptno, empid", which is discovered to be trivial and removed. It's worth adding a test case to make sure this happens.
        Hide
        jcamachorodriguez Jesus Camacho Rodriguez added a comment -

        Sure, I will add that testcase too, thanks.

        There is another angle here btw: empid is a primary key, thus we could potentially generate the rewriting without the distinct by exploiting that information, implementing a rule that removes the distinct if the column does not have duplicate values. I will fill in another JIRA for that.

        Show
        jcamachorodriguez Jesus Camacho Rodriguez added a comment - Sure, I will add that testcase too, thanks. There is another angle here btw: empid is a primary key, thus we could potentially generate the rewriting without the distinct by exploiting that information, implementing a rule that removes the distinct if the column does not have duplicate values. I will fill in another JIRA for that.
        Hide
        julianhyde Julian Hyde added a comment -

        After your fix, can you make sure that

        select t1.deptno, count(t1.empid)
        from (
          select deptno, empid
          from emps
          group by deptno, empid) as t1
        group by t1.deptno;

        (same as your query, but no distinct) gets the same rewrite, because it equivalent to your query.

        Show
        julianhyde Julian Hyde added a comment - After your fix, can you make sure that select t1.deptno, count(t1.empid) from ( select deptno, empid from emps group by deptno, empid) as t1 group by t1.deptno; (same as your query, but no distinct ) gets the same rewrite, because it equivalent to your query.
        Hide
        jcamachorodriguez Jesus Camacho Rodriguez added a comment -

        Yes, I agree, but the rewriting is indeed incorrect.

        MV:

        select deptno, empid, salary
        from emps
        group by deptno, empid, salary;
        

        Query:

        select t1.deptno, count(distinct t1.empid)
        from (
          select deptno, empid
          from emps
          group by deptno, empid) as t1
        group by t1.deptno;
        

        Produced rewrite (currently):

        select deptno, count(empid)
        from mv
        group by deptno;
        

        Correct rewrite:

        select deptno, count(distinct empid)
        from mv
        group by deptno;
        
        Show
        jcamachorodriguez Jesus Camacho Rodriguez added a comment - Yes, I agree, but the rewriting is indeed incorrect. MV: select deptno, empid, salary from emps group by deptno, empid, salary; Query: select t1.deptno, count(distinct t1.empid) from ( select deptno, empid from emps group by deptno, empid) as t1 group by t1.deptno; Produced rewrite (currently): select deptno, count(empid) from mv group by deptno; Correct rewrite: select deptno, count(distinct empid) from mv group by deptno;
        Hide
        julianhyde Julian Hyde added a comment -

        Yes, I saw that. But the query's input has grouped on (deptno, empid) and therefore the distinct is superfluous.

        Show
        julianhyde Julian Hyde added a comment - Yes, I saw that. But the query's input has grouped on (deptno, empid) and therefore the distinct is superfluous.
        Hide
        jcamachorodriguez Jesus Camacho Rodriguez added a comment -

        Julian Hyde, observe the salary column in the group by clause in the view.

        Show
        jcamachorodriguez Jesus Camacho Rodriguez added a comment - Julian Hyde , observe the salary column in the group by clause in the view.
        Hide
        julianhyde Julian Hyde added a comment -

        Not convinced - I claim that

         select t1.deptno,count(distinct t1.empid) from (select deptno,empid from emps group by deptno,empid) as t1 group by t1.deptno;

        is equivalent to

         select t1.deptno,count(t1.empid) from (select deptno,empid from emps group by deptno,empid) as t1 group by t1.deptno;
        Show
        julianhyde Julian Hyde added a comment - Not convinced - I claim that select t1.deptno,count(distinct t1.empid) from (select deptno,empid from emps group by deptno,empid) as t1 group by t1.deptno; is equivalent to select t1.deptno,count(t1.empid) from (select deptno,empid from emps group by deptno,empid) as t1 group by t1.deptno;
        Hide
        jcamachorodriguez Jesus Camacho Rodriguez added a comment -

        fei.chen, Julian Hyde, I will take a look at this.

        Show
        jcamachorodriguez Jesus Camacho Rodriguez added a comment - fei.chen , Julian Hyde , I will take a look at this.
        Hide
        chenfeisd@163.com fei.chen added a comment -

        postgres=#
        postgres=#
        postgres=# create table emps(deptno varchar(20),empid varchar(20),salary int);
        CREATE TABLE
        postgres=# insert into emps values('d1','e1',11);
        INSERT 0 1
        postgres=# insert into emps values('d1','e1',12);
        INSERT 0 1
        postgres=# create view mo as select deptno,empid,salary from emps group by deptno,empid,salary;
        CREATE VIEW
        postgres=# select t1.deptno,count(distinct t1.empid) from (select deptno,empid from emps group by deptno,empid) as t1 group by t1.deptno;
        deptno | count
        -------+------
        d1 | 1
        (1 row)

        postgres=# select deptno,count(empid) from mo group by deptno;
        deptno | count
        -------+------
        d1 | 2
        (1 row)

        postgres=#

        Show
        chenfeisd@163.com fei.chen added a comment - postgres=# postgres=# postgres=# create table emps(deptno varchar(20),empid varchar(20),salary int); CREATE TABLE postgres=# insert into emps values('d1','e1',11); INSERT 0 1 postgres=# insert into emps values('d1','e1',12); INSERT 0 1 postgres=# create view mo as select deptno,empid,salary from emps group by deptno,empid,salary; CREATE VIEW postgres=# select t1.deptno,count(distinct t1.empid) from (select deptno,empid from emps group by deptno,empid) as t1 group by t1.deptno; deptno | count ------- + ------ d1 | 1 (1 row) postgres=# select deptno,count(empid) from mo group by deptno; deptno | count ------- + ------ d1 | 2 (1 row) postgres=#
        Hide
        julianhyde Julian Hyde added a comment -

        The rewrite looks valid to me, because empid is already unique for any given deptno value, due to the group by on the input.

        If you disagree, see if you can get the query to produce a different result than say postgreSQL or oracle.

        Show
        julianhyde Julian Hyde added a comment - The rewrite looks valid to me, because empid is already unique for any given deptno value, due to the group by on the input. If you disagree, see if you can get the query to produce a different result than say postgreSQL or oracle.

          People

          • Assignee:
            jcamachorodriguez Jesus Camacho Rodriguez
            Reporter:
            chenfeisd@163.com fei.chen
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development