Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-1493

Wrong plan for NOT IN correlated queries

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.11.0
    • Component/s: core
    • Labels:

      Description

      Plan generated by calcite with SubqueryRemoveRule followed by de-correlation for the following query:

       select sal from emp where empno NOT IN (select deptno from dept where emp.job = dept.name) 

      is

      LogicalProject(SAL=[$5])
        LogicalProject(EMPNO=[$0], ENAME=[$1], JOB=[$2], MGR=[$3], HIREDATE=[$4], SAL=[$5], COMM=[$6], DEPTNO=[$7], SLACKER=[$8])
          LogicalFilter(condition=[IS NULL($11)])
            LogicalProject(EMPNO=[$0], ENAME=[$1], JOB=[$2], MGR=[$3], HIREDATE=[$4], SAL=[$5], COMM=[$6], DEPTNO=[$7], SLACKER=[$8], DEPTNO0=[CAST($9):INTEGER], JOB0=[CAST($10):VARCHAR(10) CHARACTER SET "ISO-8859-1" COLLATE "ISO-8859-1$en_US$primary"], $f2=[CAST($11):BOOLEAN])
              LogicalJoin(condition=[AND(=($2, $10), =($0, $9))], joinType=[inner])
                LogicalTableScan(table=[[CATALOG, SALES, EMP]])
                LogicalProject(DEPTNO=[$0], JOB=[$1], $f2=[true])
                  LogicalAggregate(group=[{0, 1}])
                    LogicalProject(DEPTNO=[$0], JOB=[$2], i=[$1])
                      LogicalProject(DEPTNO=[$0], i=[true], JOB=[$1])
                        LogicalProject(DEPTNO=[$0], JOB=[$2])
                          LogicalJoin(condition=[=($2, $1)], joinType=[inner])
                            LogicalTableScan(table=[[CATALOG, SALES, DEPT]])
                            LogicalAggregate(group=[{0}])
                              LogicalProject(JOB=[$2])
                                LogicalTableScan(table=[[CATALOG, SALES, EMP]])
      

      As you can notice instead of doing Left Outer Join Calcite is doing Inner Join. This will produce wrong results.

      Plan for same query just before SubqueryRemove Rule is:

      LogicalProject(SAL=[$5])
        LogicalFilter(condition=[NOT(IN($0, {
      LogicalProject(DEPTNO=[$0])
        LogicalFilter(condition=[=($cor0.JOB, $1)])
          LogicalTableScan(table=[[CATALOG, SALES, DEPT]])
      }))], variablesSet=[[$cor0]])
          LogicalTableScan(table=[[CATALOG, SALES, EMP]])
      

      Plan just after SubqueryRemove Rule:

      LogicalProject(SAL=[$5])
        LogicalProject(EMPNO=[$0], ENAME=[$1], JOB=[$2], MGR=[$3], HIREDATE=[$4], SAL=[$5], COMM=[$6], DEPTNO=[$7], SLACKER=[$8])
          LogicalFilter(condition=[IS NULL($10)])
            LogicalFilter(condition=[=($0, $9)])
              LogicalCorrelate(correlation=[$cor0], joinType=[LEFT], requiredColumns=[{2}])
                LogicalTableScan(table=[[CATALOG, SALES, EMP]])
                LogicalAggregate(group=[{0, 1}])
                  LogicalProject(DEPTNO=[$0], i=[true])
                    LogicalProject(DEPTNO=[$0])
                      LogicalFilter(condition=[=($cor0.JOB, $1)])
                        LogicalTableScan(table=[[CATALOG, SALES, DEPT]])
      

      Looking at above it seems RelDecorrelator have some issue where it is coming up with Inner Join.

        Activity

        Hide
        vgarg Vineet Garg added a comment - - edited

        Note that I had to hack RelOptTestBase's checkPlanning method to first generate the plan, apply SubqueryRemove rule and then immediately do decorrelation using RelDecorrelator.decorrelateQuery to reproduce above plans. (This is exactly the same way Hive plans to use Calcite for processing subqueries).
        I tried adding test in SqlToRelConverterTest to reproduce this but it seems to bypass SubqueryRemove Rule.

        Show
        vgarg Vineet Garg added a comment - - edited Note that I had to hack RelOptTestBase's checkPlanning method to first generate the plan, apply SubqueryRemove rule and then immediately do decorrelation using RelDecorrelator.decorrelateQuery to reproduce above plans. (This is exactly the same way Hive plans to use Calcite for processing subqueries). I tried adding test in SqlToRelConverterTest to reproduce this but it seems to bypass SubqueryRemove Rule.
        Hide
        vgarg Vineet Garg added a comment -

        I notice that the plan just after SubqueryRemoveRule consists of LogicalCorrelate with LEFT joinType and a LogicalFilter corresponding to WHERE IN subquery column on top of it. While de-correlating this LogicalCorrelate decorrelation logic appropriately creates LEFT JOIN with condition as conjunction of co-related variables, but it leaves LogicalFilter on top of it. After de-correlation decorrelateRel calls findBestExp where in this LogicalFilter + LogicalJoin(LEFT) is converted to INNER JOIN.

        I believe creating LogicalFilter on top of LogicalCorrelate is wrong. If LogicalCorrelate is of type LEFT then this LogicalFilter should rather be a part of LogicalCorrelate as a condition. (Although it looks like current design doesn't allow a condition with LogicalCorrelate ).

        Julian Hyde Any thoughts on this ?

        Show
        vgarg Vineet Garg added a comment - I notice that the plan just after SubqueryRemoveRule consists of LogicalCorrelate with LEFT joinType and a LogicalFilter corresponding to WHERE IN subquery column on top of it. While de-correlating this LogicalCorrelate decorrelation logic appropriately creates LEFT JOIN with condition as conjunction of co-related variables, but it leaves LogicalFilter on top of it. After de-correlation decorrelateRel calls findBestExp where in this LogicalFilter + LogicalJoin(LEFT) is converted to INNER JOIN. I believe creating LogicalFilter on top of LogicalCorrelate is wrong. If LogicalCorrelate is of type LEFT then this LogicalFilter should rather be a part of LogicalCorrelate as a condition. (Although it looks like current design doesn't allow a condition with LogicalCorrelate ). Julian Hyde Any thoughts on this ?
        Hide
        julianhyde Julian Hyde added a comment -

        Don't use checkPlanning. You get more flexibility if you use sql(...) (the fluent API). The fluent API can be extended if necessary.

        Not surprised that SqlToRelConverterTest skips SubqueryRemoveRule; it doesn't do any planning.

        Show
        julianhyde Julian Hyde added a comment - Don't use checkPlanning . You get more flexibility if you use sql(...) (the fluent API). The fluent API can be extended if necessary. Not surprised that SqlToRelConverterTest skips SubqueryRemoveRule ; it doesn't do any planning.
        Hide
        julianhyde Julian Hyde added a comment -

        Is it absolutely necessary to put the filter condition inside Correlate? Could the Filter go under the Correlate? Then Correlate(joinType=LEFT) will generate a NULL row if the input is empty.

        Show
        julianhyde Julian Hyde added a comment - Is it absolutely necessary to put the filter condition inside Correlate? Could the Filter go under the Correlate? Then Correlate(joinType=LEFT) will generate a NULL row if the input is empty.
        Hide
        vgarg Vineet Garg added a comment -

        The filter condition is on subquery expression column e.g. in this case filter is

         emp.empno = dept.deptno 

        which can not go under the Correlate.

        Show
        vgarg Vineet Garg added a comment - The filter condition is on subquery expression column e.g. in this case filter is emp.empno = dept.deptno which can not go under the Correlate.
        Hide
        julianhyde Julian Hyde added a comment -

        Why? The right input to Correlate can see columns from its own input but also from the left input to Correlate (via the variable).

        Show
        julianhyde Julian Hyde added a comment - Why? The right input to Correlate can see columns from its own input but also from the left input to Correlate (via the variable).
        Hide
        vgarg Vineet Garg added a comment -

        Julian Hyde I am not sure if I fully understand that. I understand that Correlate's right input can see it's sibling's column through correlate but I am still not sure how the filter would look like under Correlate and how would it be handled during decorrelation ?

        Also the plan above doesn't look like it handles NULL cases given this is NOT IN query.

        Show
        vgarg Vineet Garg added a comment - Julian Hyde I am not sure if I fully understand that. I understand that Correlate's right input can see it's sibling's column through correlate but I am still not sure how the filter would look like under Correlate and how would it be handled during decorrelation ? Also the plan above doesn't look like it handles NULL cases given this is NOT IN query.
        Hide
        julianhyde Julian Hyde added a comment -

        First the null case. It does not need to handle null cases, because dept.deptno is mandatory.

        To be clear, I agree with you that something is wrong with the plan. I don't know what the full plan should be. In order to figure that out, I would have to stop what I am doing and fix the bug.

        But I am skeptical that the only solution is to add a filter to LogicalCorrelate.

        Show
        julianhyde Julian Hyde added a comment - First the null case. It does not need to handle null cases, because dept.deptno is mandatory. To be clear, I agree with you that something is wrong with the plan. I don't know what the full plan should be. In order to figure that out, I would have to stop what I am doing and fix the bug. But I am skeptical that the only solution is to add a filter to LogicalCorrelate.
        Hide
        julianhyde Julian Hyde added a comment -

        Vineet Garg Can you provide code that reproduces this? I just tried a similar query and hit CALCITE-1513.

        Show
        julianhyde Julian Hyde added a comment - Vineet Garg Can you provide code that reproduces this? I just tried a similar query and hit CALCITE-1513 .
        Hide
        vgarg Vineet Garg added a comment -

        Julian Hyde I updated RelOptRulesTest's checkPlanning method to do de-correlation immediately after SubQueryRemoveRule on query

         select sal from emp where empno NOT IN (select deptno from dept where emp.job = dept.name)  

        .
        Let me generate and upload a patch

        Show
        vgarg Vineet Garg added a comment - Julian Hyde I updated RelOptRulesTest's checkPlanning method to do de-correlation immediately after SubQueryRemoveRule on query select sal from emp where empno NOT IN (select deptno from dept where emp.job = dept.name) . Let me generate and upload a patch
        Hide
        vgarg Vineet Garg added a comment -

        I have uploaded test patch. Let me know if you are still not able to re-produce it. In RelOptTestBase's checkPlanning you should be able to look at relAfter & relAfterDecorrelate to get to plans

        Show
        vgarg Vineet Garg added a comment - I have uploaded test patch. Let me know if you are still not able to re-produce it. In RelOptTestBase's checkPlanning you should be able to look at relAfter & relAfterDecorrelate to get to plans
        Hide
        julianhyde Julian Hyde added a comment - - edited

        Your "Plan just after SubqueryRemove Rule" above is mangled because you made a mistake in RelOptTestBase:

            RelNode relAfter = planner.findBestExp();
            RelNode relAfterDecorrelate = RelDecorrelator.decorrelateQuery(relAfter);
        
            String planAfter = NL + RelOptUtil.toString(relAfter);
        

        You assign to relAfterDecorrelate but then you compare relAfter. At this point relAfter has been chewed up by the decorrelation process, and is a mess. relAfterDecorrelate is the following, which looks OK to me:

        LogicalProject(SAL=[$5])
          LogicalProject(EMPNO=[$0], ENAME=[$1], JOB=[$2], MGR=[$3], HIREDATE=[$4], SAL=[$5], COMM=[$6], DEPTNO=[$7], SLACKER=[$8])
            LogicalFilter(condition=[IS NULL($11)])
              LogicalProject(EMPNO=[$0], ENAME=[$1], JOB=[$2], MGR=[$3], HIREDATE=[$4], SAL=[$5], COMM=[$6], DEPTNO=[$7], SLACKER=[$8], DEPTNO0=[CAST($9):INTEGER], JOB0=[CAST($10):VARCHAR(10) CHARACTER SET "ISO-8859-1" COLLATE "ISO-8859-1$en_US$primary"], $f2=[CAST($11):BOOLEAN])
                LogicalJoin(condition=[AND(=($2, $10), =($0, $9))], joinType=[inner])
                  LogicalTableScan(table=[[CATALOG, SALES, EMP]])
                  LogicalProject(DEPTNO=[$0], JOB=[$1], $f2=[true])
                    LogicalAggregate(group=[{0, 1}])
                      LogicalProject(DEPTNO=[$0], JOB=[$2], i=[$1])
                        LogicalProject(DEPTNO=[$0], i=[true], JOB=[$1])
                          LogicalProject(DEPTNO=[$0], JOB=[$2])
                            LogicalJoin(condition=[=($2, $1)], joinType=[inner])
                              LogicalTableScan(table=[[CATALOG, SALES, DEPT]])
                              LogicalAggregate(group=[{0}])
                                LogicalProject(JOB=[$2])
                                  LogicalTableScan(table=[[CATALOG, SALES, EMP]])
        
        Show
        julianhyde Julian Hyde added a comment - - edited Your "Plan just after SubqueryRemove Rule" above is mangled because you made a mistake in RelOptTestBase: RelNode relAfter = planner.findBestExp(); RelNode relAfterDecorrelate = RelDecorrelator.decorrelateQuery(relAfter); String planAfter = NL + RelOptUtil.toString(relAfter); You assign to relAfterDecorrelate but then you compare relAfter . At this point relAfter has been chewed up by the decorrelation process, and is a mess. relAfterDecorrelate is the following, which looks OK to me: LogicalProject(SAL=[$5]) LogicalProject(EMPNO=[$0], ENAME=[$1], JOB=[$2], MGR=[$3], HIREDATE=[$4], SAL=[$5], COMM=[$6], DEPTNO=[$7], SLACKER=[$8]) LogicalFilter(condition=[IS NULL($11)]) LogicalProject(EMPNO=[$0], ENAME=[$1], JOB=[$2], MGR=[$3], HIREDATE=[$4], SAL=[$5], COMM=[$6], DEPTNO=[$7], SLACKER=[$8], DEPTNO0=[CAST($9):INTEGER], JOB0=[CAST($10):VARCHAR(10) CHARACTER SET "ISO-8859-1" COLLATE "ISO-8859-1$en_US$primary"], $f2=[CAST($11):BOOLEAN]) LogicalJoin(condition=[AND(=($2, $10), =($0, $9))], joinType=[inner]) LogicalTableScan(table=[[CATALOG, SALES, EMP]]) LogicalProject(DEPTNO=[$0], JOB=[$1], $f2=[true]) LogicalAggregate(group=[{0, 1}]) LogicalProject(DEPTNO=[$0], JOB=[$2], i=[$1]) LogicalProject(DEPTNO=[$0], i=[true], JOB=[$1]) LogicalProject(DEPTNO=[$0], JOB=[$2]) LogicalJoin(condition=[=($2, $1)], joinType=[inner]) LogicalTableScan(table=[[CATALOG, SALES, DEPT]]) LogicalAggregate(group=[{0}]) LogicalProject(JOB=[$2]) LogicalTableScan(table=[[CATALOG, SALES, EMP]])
        Hide
        vgarg Vineet Garg added a comment - - edited

        I realized this plan is mangled. I wasn't really comparing relAfter. If you see the original description of this JIRA I was comparing relAfter and relAfterDecorrelate
        Issue I see with this plan is that

        LogicalJoin(condition=[AND(=($2, $10), =($0, $9))], joinType=[inner])

        is doing INNER JOIN instead of LEFT OUTER JOIN which will produce wrong results

        Show
        vgarg Vineet Garg added a comment - - edited I realized this plan is mangled. I wasn't really comparing relAfter . If you see the original description of this JIRA I was comparing relAfter and relAfterDecorrelate Issue I see with this plan is that LogicalJoin(condition=[AND(=($2, $10), =($0, $9))], joinType=[ inner ]) is doing INNER JOIN instead of LEFT OUTER JOIN which will produce wrong results
        Hide
        julianhyde Julian Hyde added a comment -

        Added Vineet Garg's test case in http://git-wip-us.apache.org/repos/asf/calcite/commit/bac9ee7c, and cleaned up test infrastructure so that we can control the point in a test where we decorrelate.

        Show
        julianhyde Julian Hyde added a comment - Added Vineet Garg 's test case in http://git-wip-us.apache.org/repos/asf/calcite/commit/bac9ee7c , and cleaned up test infrastructure so that we can control the point in a test where we decorrelate.
        Hide
        julianhyde Julian Hyde added a comment -
        Show
        julianhyde Julian Hyde added a comment - Preliminary fix in https://github.com/julianhyde/calcite/tree/1493-correl-plan . Vineet Garg , please take a look.
        Hide
        vgarg Vineet Garg added a comment -
        +    LogicalFilter(condition=[IS NULL($11)])
        +      LogicalJoin(condition=[AND(=($0, $12), =($2, $10))], joinType=[left])
        +        LogicalTableScan(table=[[CATALOG, SALES, EMP]])
        +        LogicalJoin(condition=[=($3, $0)], joinType=[inner]) <======================================Why is this required ?
        +          LogicalProject(DEPTNO=[$0], JOB=[$1], $f2=[true])
        +            LogicalAggregate(group=[{0, 1}])
        +              LogicalProject(DEPTNO=[$0], JOB=[$2], i=[$1])
        +                LogicalProject(DEPTNO=[$0], i=[true], JOB=[$1])
        +                  LogicalProject(DEPTNO=[$0], JOB=[$2])
        +                    LogicalJoin(condition=[=($2, $1)], joinType=[inner])
        +                      LogicalTableScan(table=[[CATALOG, SALES, DEPT]])
        +                      LogicalAggregate(group=[{0}])
        +                        LogicalProject(JOB=[$2])
        +                          LogicalTableScan(table=[[CATALOG, SALES, EMP]])
        +          LogicalAggregate(group=[{0}])       
        +            LogicalProject(EMPNO=[$0])
        +              LogicalTableScan(table=[[CATALOG, SALES, EMP]])
         ]]>
        

        Thanks Julian. I am wondering why is INNER JOIN in above plan required, where inner query is joined with distinct empno with condition empno=deptno. This is effectively making one of the left outer join condition empno=deptno always true. Not sure if this is semantically correct.

        Show
        vgarg Vineet Garg added a comment - + LogicalFilter(condition=[IS NULL($11)]) + LogicalJoin(condition=[AND(=($0, $12), =($2, $10))], joinType=[left]) + LogicalTableScan(table=[[CATALOG, SALES, EMP]]) + LogicalJoin(condition=[=($3, $0)], joinType=[ inner ]) <======================================Why is this required ? + LogicalProject(DEPTNO=[$0], JOB=[$1], $f2=[ true ]) + LogicalAggregate(group=[{0, 1}]) + LogicalProject(DEPTNO=[$0], JOB=[$2], i=[$1]) + LogicalProject(DEPTNO=[$0], i=[ true ], JOB=[$1]) + LogicalProject(DEPTNO=[$0], JOB=[$2]) + LogicalJoin(condition=[=($2, $1)], joinType=[ inner ]) + LogicalTableScan(table=[[CATALOG, SALES, DEPT]]) + LogicalAggregate(group=[{0}]) + LogicalProject(JOB=[$2]) + LogicalTableScan(table=[[CATALOG, SALES, EMP]]) + LogicalAggregate(group=[{0}]) + LogicalProject(EMPNO=[$0]) + LogicalTableScan(table=[[CATALOG, SALES, EMP]]) ]]> Thanks Julian. I am wondering why is INNER JOIN in above plan required, where inner query is joined with distinct empno with condition empno=deptno . This is effectively making one of the left outer join condition empno=deptno always true. Not sure if this is semantically correct.
        Hide
        julianhyde Julian Hyde added a comment -

        To be honest, I'm not sure whether that join is required, and it's very hard to answer.

        For this bug, can we please stick to the issue of whether the plan is correct, not whether it is efficient. Can you figure out whether it is correct (i.e. produces the correct results for all possible data). If it's not correct give a data set that breaks it. If it is correct I will commit the fix.

        Show
        julianhyde Julian Hyde added a comment - To be honest, I'm not sure whether that join is required, and it's very hard to answer. For this bug, can we please stick to the issue of whether the plan is correct, not whether it is efficient. Can you figure out whether it is correct (i.e. produces the correct results for all possible data). If it's not correct give a data set that breaks it. If it is correct I will commit the fix.
        Hide
        vgarg Vineet Garg added a comment -

        Sure. Give me sometime to think of a case where this might produce wrong result. So far semantically this plan looks correct.

        Show
        vgarg Vineet Garg added a comment - Sure. Give me sometime to think of a case where this might produce wrong result. So far semantically this plan looks correct.
        Hide
        vgarg Vineet Garg added a comment -

        Julian Hyde This looks fine. I couldn't think of case where such plan will produce wrong results.

        Show
        vgarg Vineet Garg added a comment - Julian Hyde This looks fine. I couldn't think of case where such plan will produce wrong results.
        Hide
        vgarg Vineet Garg added a comment - - edited

        Is it possible to add a test case involving null keys ? I am assuming plan will have count(*), count() to consider case with NULL keys. Above plan looks fine for non-null keys.

        Show
        vgarg Vineet Garg added a comment - - edited Is it possible to add a test case involving null keys ? I am assuming plan will have count(*), count() to consider case with NULL keys. Above plan looks fine for non-null keys.
        Hide
        julianhyde Julian Hyde added a comment -

        Can you point me to some test cases (maybe in Hive's suite) for this kind of query? With the data set (maybe CREATE TABLE and INSERT statements), queries and expected results? I'm thinking of allowing CREATE TABLE in Quidem tests (.iq files) and so we should be able to run the tests directly in Calcite.

        Show
        julianhyde Julian Hyde added a comment - Can you point me to some test cases (maybe in Hive's suite) for this kind of query? With the data set (maybe CREATE TABLE and INSERT statements), queries and expected results? I'm thinking of allowing CREATE TABLE in Quidem tests (.iq files) and so we should be able to run the tests directly in Calcite.
        Hide
        vgarg Vineet Garg added a comment -

        Following is one example of such test case:

        create table table1(i int, j int);
        create table table2(i int, j int);
        
        insert into table1 values(1, 2);
        insert into table1 values(1, 3);
        
        insert into table2 values(NULL, 1);
        insert into table2 values(2, 1);
        
        select i, j from table1 where table1.j NOT IN (select i from table2 where table1.i=table2.j);
        
        expected result: no rows
        

        More could be found at https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/subquery_notin.q
        Schema used by queries in subquery_notin.q is setup in https://github.com/apache/hive/blob/master/data/scripts/q_test_init.sql

        Show
        vgarg Vineet Garg added a comment - Following is one example of such test case: create table table1(i int , j int ); create table table2(i int , j int ); insert into table1 values(1, 2); insert into table1 values(1, 3); insert into table2 values(NULL, 1); insert into table2 values(2, 1); select i, j from table1 where table1.j NOT IN (select i from table2 where table1.i=table2.j); expected result: no rows More could be found at https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/subquery_notin.q Schema used by queries in subquery_notin.q is setup in https://github.com/apache/hive/blob/master/data/scripts/q_test_init.sql
        Hide
        vgarg Vineet Garg added a comment -

        Julian Hyde I in-coporated your changes into Hive and tested various queries (https://github.com/vineetgarg02/hive/blob/HIVE-15192/ql/src/test/queries/clientpositive/subquery_notin.q, https://github.com/vineetgarg02/hive/blob/HIVE-15192/ql/src/test/queries/clientpositive/subquery_notin_having.q) which also include null keys. I did not see any wrong result.
        Are you still planning to allow CREATE TABLE in tests? If not this patch looks good to me and you can go ahead and commit it.

        Show
        vgarg Vineet Garg added a comment - Julian Hyde I in-coporated your changes into Hive and tested various queries ( https://github.com/vineetgarg02/hive/blob/HIVE-15192/ql/src/test/queries/clientpositive/subquery_notin.q , https://github.com/vineetgarg02/hive/blob/HIVE-15192/ql/src/test/queries/clientpositive/subquery_notin_having.q ) which also include null keys. I did not see any wrong result. Are you still planning to allow CREATE TABLE in tests? If not this patch looks good to me and you can go ahead and commit it.
        Hide
        vgarg Vineet Garg added a comment -

        Ignore my last comment. I found a bug with this patch. This patch is not generating plan with count(), count(*) to consider null values.

        ===Reproducer====

        create table t1(a int, b int);
        insert into t1 values(1,0);
        insert into t1 values(1,0);
        insert into t1 values(1,0);
        
        create table t2(a int, b int);
        insert into t2 values(2,1);
        insert into t2 values(3,1);
        insert into t2 values(NULL,1);
        
        select t1.a from t1 where t1.b NOT IN (select t2.a from t2 where t2.b=t1.a);
        
        --Expected result: 
        zero rows
        --Actual result
        1
        1
        1
        Fetched: 3 row(s)
        

        This is what the plan looks like on HIVE side just after decorrelation

        HiveProject(a=[$0])
          LogicalProject(a=[$0], b=[$1], BLOCK__OFFSET__INSIDE__FILE=[$2], INPUT__FILE__NAME=[$3], ROW__ID=[$4])
            LogicalFilter(condition=[NOT(CASE(IS NOT NULL($7), true, IS NULL($1), null, false))])
              LogicalJoin(condition=[AND(=($0, $6), =($1, $8))], joinType=[left])
                HiveTableScan(table=[[default.t_1]], table:alias=[t_1])
                LogicalJoin(condition=[=($3, $0)], joinType=[inner])
                  LogicalProject(a=[$0], a0=[$1], $f2=[true])
                    LogicalAggregate(group=[{0, 1}])
                      LogicalProject(a=[$0], a0=[$2], i=[$1])
                        LogicalProject(a=[$0], i=[true], a0=[$1])
                          HiveProject(a=[$0], a0=[$5])
                            LogicalJoin(condition=[=($1, $5)], joinType=[inner])
                              HiveTableScan(table=[[default.t_2]], table:alias=[t_2])
                              LogicalAggregate(group=[{0}])
                                LogicalProject(a=[$0])
                                  HiveTableScan(table=[[default.t_1]], table:alias=[t_1])
                  LogicalAggregate(group=[{0}])
                    LogicalProject(b=[$1])
                      HiveTableScan(table=[[default.t_1]], table:alias=[t_1])
        

        As you can notice this plan is not doing null check using count, count*

        Show
        vgarg Vineet Garg added a comment - Ignore my last comment. I found a bug with this patch. This patch is not generating plan with count(), count(*) to consider null values. ===Reproducer==== create table t1(a int , b int ); insert into t1 values(1,0); insert into t1 values(1,0); insert into t1 values(1,0); create table t2(a int , b int ); insert into t2 values(2,1); insert into t2 values(3,1); insert into t2 values(NULL,1); select t1.a from t1 where t1.b NOT IN (select t2.a from t2 where t2.b=t1.a); --Expected result: zero rows --Actual result 1 1 1 Fetched: 3 row(s) This is what the plan looks like on HIVE side just after decorrelation HiveProject(a=[$0]) LogicalProject(a=[$0], b=[$1], BLOCK__OFFSET__INSIDE__FILE=[$2], INPUT__FILE__NAME=[$3], ROW__ID=[$4]) LogicalFilter(condition=[NOT(CASE(IS NOT NULL($7), true, IS NULL($1), null, false))]) LogicalJoin(condition=[AND(=($0, $6), =($1, $8))], joinType=[left]) HiveTableScan(table=[[default.t_1]], table:alias=[t_1]) LogicalJoin(condition=[=($3, $0)], joinType=[inner]) LogicalProject(a=[$0], a0=[$1], $f2=[true]) LogicalAggregate(group=[{0, 1}]) LogicalProject(a=[$0], a0=[$2], i=[$1]) LogicalProject(a=[$0], i=[true], a0=[$1]) HiveProject(a=[$0], a0=[$5]) LogicalJoin(condition=[=($1, $5)], joinType=[inner]) HiveTableScan(table=[[default.t_2]], table:alias=[t_2]) LogicalAggregate(group=[{0}]) LogicalProject(a=[$0]) HiveTableScan(table=[[default.t_1]], table:alias=[t_1]) LogicalAggregate(group=[{0}]) LogicalProject(b=[$1]) HiveTableScan(table=[[default.t_1]], table:alias=[t_1]) As you can notice this plan is not doing null check using count, count*
        Hide
        julianhyde Julian Hyde added a comment -

        Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/1ccebc89. Thanks for you help in developing a fix, Vineet Garg!

        Show
        julianhyde Julian Hyde added a comment - Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/1ccebc89 . Thanks for you help in developing a fix, Vineet Garg !
        Hide
        julianhyde Julian Hyde added a comment -

        Resolved in release 1.11.0 (2017-01-11).

        Show
        julianhyde Julian Hyde added a comment - Resolved in release 1.11.0 (2017-01-11).

          People

          • Assignee:
            julianhyde Julian Hyde
            Reporter:
            vgarg Vineet Garg
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development