Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-1494

Inefficient plan for correlated sub-queries

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.12.0
    • Component/s: core
    • Labels:

      Description

      For co-related queries such as

       select sal from emp where empno IN (select deptno from dept where emp.job = dept.name) 

      Calcite generates following plan (SubqueryRemove Rule + Decorrelation)

      LogicalProject(SAL=[$5])
        LogicalProject(EMPNO=[$0], ENAME=[$1], JOB=[$2], MGR=[$3], HIREDATE=[$4], SAL=[$5], COMM=[$6], DEPTNO=[$7], SLACKER=[$8])
          LogicalJoin(condition=[AND(=($2, $10), =($0, $9))], joinType=[inner])
            LogicalTableScan(table=[[CATALOG, SALES, EMP]])
            LogicalAggregate(group=[{0, 1}])
              LogicalProject(DEPTNO=[$0], JOB=[$1])
                LogicalProject(DEPTNO=[$0], JOB=[$2])
                  LogicalJoin(condition=[=($2, $1)], joinType=[inner])
                    LogicalTableScan(table=[[CATALOG, SALES, DEPT]])
                    LogicalAggregate(group=[{0}])
                      LogicalProject(JOB=[$2])
                        LogicalTableScan(table=[[CATALOG, SALES, EMP]])
      

      As you can notice there is a outer table scan (EMP in this case) to retrieve all distinct values for co-related column (EMP.JOB here), which is then joined with inner table (DEPT).
      I am not sure why is this step required. After this join Calcite is anyway doing group by to generate all distinct values for correlated and result column (DEPTNO, JOB) which is then joined with outer table.
      I think the scan + join of outer table with inner table to generate co-rrelated values is un-necessary and is not required.

        Issue Links

          Activity

          Hide
          vgarg Vineet Garg added a comment -

          Note that Calcite seems to generate similar plan for EXISTS/NOT EXISTS correlated queries as well

          Show
          vgarg Vineet Garg added a comment - Note that Calcite seems to generate similar plan for EXISTS/NOT EXISTS correlated queries as well
          Hide
          julianhyde Julian Hyde added a comment -

          Calcite's general algorithm to de-correlate creates a "value generator" query to generate all possible values of a correlating variable. In some cases, such as this, the value generator is not necessary because the variable appears in an equi-join. But it's not currently smart enough to skip the value generator query.

          Show
          julianhyde Julian Hyde added a comment - Calcite's general algorithm to de-correlate creates a "value generator" query to generate all possible values of a correlating variable. In some cases, such as this, the value generator is not necessary because the variable appears in an equi-join. But it's not currently smart enough to skip the value generator query.
          Hide
          julianhyde Julian Hyde added a comment -

          I've made progress (see dev branch) and several plans are showing improvement but there are 7 test failures, so this is not going to make the cut for release 1.11.

          Show
          julianhyde Julian Hyde added a comment - I've made progress (see dev branch) and several plans are showing improvement but there are 7 test failures, so this is not going to make the cut for release 1.11.
          Hide
          vgarg Vineet Garg added a comment - - edited

          Hi Julian Hyde,

          I have a commit at https://github.com/vineetgarg02/calcite/commit/f3bb8941ac258b49ce88dbe48110ee0bd1c8fb91 which fixes following test failures

          • SqlToRelConverterTest.testNestedCorrelationsDecorrelated
          • SqlToRelConverterTest.testNestedCorrelationsDecorrelated2
          • RelOptRulesTest.testWhereNotInCorrelated
          • RelOptRulesTest.testWhereNotInCorrelated2

          I am not sure about blank.iq as I wasn't able to run it.
          Also I see org.apache.calcite.test.MaterializationTest.testSubQuery is still failing but from what I have looked this is expected and needs a test output update.

          Can you take a look at this commit and incorporate into your branch and re-run calcite tests? Let me know if you would like me to upload the patch here.

          Note that my fix is on top of your 1494-val-gen changes.

          Show
          vgarg Vineet Garg added a comment - - edited Hi Julian Hyde , I have a commit at https://github.com/vineetgarg02/calcite/commit/f3bb8941ac258b49ce88dbe48110ee0bd1c8fb91 which fixes following test failures SqlToRelConverterTest.testNestedCorrelationsDecorrelated SqlToRelConverterTest.testNestedCorrelationsDecorrelated2 RelOptRulesTest.testWhereNotInCorrelated RelOptRulesTest.testWhereNotInCorrelated2 I am not sure about blank.iq as I wasn't able to run it. Also I see org.apache.calcite.test.MaterializationTest.testSubQuery is still failing but from what I have looked this is expected and needs a test output update. Can you take a look at this commit and incorporate into your branch and re-run calcite tests? Let me know if you would like me to upload the patch here. Note that my fix is on top of your 1494-val-gen changes.
          Hide
          julianhyde Julian Hyde added a comment -

          Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/73e437fe. Many thanks to Vineet Garg for his contributions to the fix!

          Show
          julianhyde Julian Hyde added a comment - Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/73e437fe . Many thanks to Vineet Garg for his contributions to the fix!
          Hide
          julianhyde Julian Hyde added a comment -

          Resolved in release 1.12.0 (2017-03-24).

          Show
          julianhyde Julian Hyde added a comment - Resolved in release 1.12.0 (2017-03-24).
          Hide
          vgarg Vineet Garg added a comment -

          Hi Julian Hyde
          Do you remember why were this optimization done only for equality predicates? e.g. findCorrelationEquivalent here LINK to diff skips if Rex Node is not of EQUAL type. We end up with value generator (join with outer query) for queries which have non-equality correlated predicates e.g. select sal from emp where empno IN (select deptno from dept where emp.job <> dept.name. Should it be safe to optimize this and not generate value generator for non-equality predicates as well?

          Show
          vgarg Vineet Garg added a comment - Hi Julian Hyde Do you remember why were this optimization done only for equality predicates? e.g. findCorrelationEquivalent here LINK to diff skips if Rex Node is not of EQUAL type. We end up with value generator (join with outer query) for queries which have non-equality correlated predicates e.g. select sal from emp where empno IN (select deptno from dept where emp.job <> dept.name . Should it be safe to optimize this and not generate value generator for non-equality predicates as well?
          Hide
          julianhyde Julian Hyde added a comment -

          When de-correlating a query, you need to generate an un-correlated sub-query that returns all possible values of the correlating variable as one of its columns (or a super-set). If the condition is "t1.c1 = 10" then the list of values is obviously

          {10}

          , and if the condition is "t1.c1 = t2.c2" then the list of values is the same as the values in t2.c2. But if the condition is "t1.c1 > t2.c2" then the list of possible values is infinite.

          Equality isn't the only condition that would work. We could in theory generate lists of values for "t1.c1 between t2.c2 and t2.c2 + 10" if c1 and c2 have a discrete domain such as integers. Then we'd only generate 10x too many values.

          We could also do the rewrite for "t1.c1 is null" and for low-cardinality domains such as tinyint, char(1), and boolean.

          But equality is the only case that comes up regularly.

          Show
          julianhyde Julian Hyde added a comment - When de-correlating a query, you need to generate an un-correlated sub-query that returns all possible values of the correlating variable as one of its columns (or a super-set). If the condition is "t1.c1 = 10" then the list of values is obviously {10} , and if the condition is "t1.c1 = t2.c2" then the list of values is the same as the values in t2.c2. But if the condition is "t1.c1 > t2.c2" then the list of possible values is infinite. Equality isn't the only condition that would work. We could in theory generate lists of values for "t1.c1 between t2.c2 and t2.c2 + 10" if c1 and c2 have a discrete domain such as integers. Then we'd only generate 10x too many values. We could also do the rewrite for "t1.c1 is null" and for low-cardinality domains such as tinyint, char(1), and boolean. But equality is the only case that comes up regularly.
          Hide
          vgarg Vineet Garg added a comment -

          Thanks for your response Julian Hyde.

          I am trying to understand why do we even need value generator? Currently for queries like select sal from emp where empno IN (select deptno from dept where emp.job = dept.name we skip value generator and instead do group by on subquery expression and correlated columns to generate possible values. Wouldn't doing so for all of the examples you provided above be semantically correct?

          Could you please provide an example query where skipping value generator will not be semantically correct thing to do?

          Show
          vgarg Vineet Garg added a comment - Thanks for your response Julian Hyde . I am trying to understand why do we even need value generator? Currently for queries like select sal from emp where empno IN (select deptno from dept where emp.job = dept.name we skip value generator and instead do group by on subquery expression and correlated columns to generate possible values. Wouldn't doing so for all of the examples you provided above be semantically correct? Could you please provide an example query where skipping value generator will not be semantically correct thing to do?
          Hide
          julianhyde Julian Hyde added a comment -
          select *
          from emp
          where exists (
            select 1
            from dept
            where deptno > emp.deptno)
          Show
          julianhyde Julian Hyde added a comment - select * from emp where exists ( select 1 from dept where deptno > emp.deptno)
          Hide
          ashutoshc Ashutosh Chauhan added a comment -

          Per this classic ; https://pdfs.semanticscholar.org/650f/dd065b407e0a4fdd854b1861e58eed289955.pdf question Vineet Garg is raising does seem fair. Why do we need 'value generator' ? If you look at query 3 on page 3 of the paper it has non-equi join predicate on correlated variable. Rewrite on page 4 is doing semijoin between inner and outer query with non-equi join predicate to get all qualifying values and then doing join of that with outer query. It does seem like generating 'all possible values of the correlating variable as one of its columns (or a super-set)' is wasteful.

          Show
          ashutoshc Ashutosh Chauhan added a comment - Per this classic ; https://pdfs.semanticscholar.org/650f/dd065b407e0a4fdd854b1861e58eed289955.pdf question Vineet Garg is raising does seem fair. Why do we need 'value generator' ? If you look at query 3 on page 3 of the paper it has non-equi join predicate on correlated variable. Rewrite on page 4 is doing semijoin between inner and outer query with non-equi join predicate to get all qualifying values and then doing join of that with outer query. It does seem like generating 'all possible values of the correlating variable as one of its columns (or a super-set)' is wasteful.
          Hide
          julianhyde Julian Hyde added a comment -

          Ashutosh Chauhan, You're making an "EXISTS" argument, and I'm making a "FOR ALL" argument.

          Yes, I totally agree with you: there are many many queries where a value generator is wasteful. (Query 3 on page 3 has lots of joins, so it make sense to use them.)

          But do you agree with me: there are some queries that cannot be solved without a value generator?

          By the way, do you have access to Seshadri, Pirahesh, Leung: Complex Query Decorrelation, 1996? I think that may contain some tough examples.

          Show
          julianhyde Julian Hyde added a comment - Ashutosh Chauhan , You're making an "EXISTS" argument, and I'm making a "FOR ALL" argument. Yes, I totally agree with you: there are many many queries where a value generator is wasteful. (Query 3 on page 3 has lots of joins, so it make sense to use them.) But do you agree with me: there are some queries that cannot be solved without a value generator? By the way, do you have access to Seshadri, Pirahesh, Leung: Complex Query Decorrelation, 1996 ? I think that may contain some tough examples.

            People

            • Assignee:
              julianhyde Julian Hyde
              Reporter:
              vgarg Vineet Garg
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development