Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-27278

Simplify correlated queries with empty inputs

    XMLWordPrintableJSON

Details

    Description

      The correlated query below will not produce any result no matter the content of the table.

      create table t1 (id int, val varchar(10)) stored as orc TBLPROPERTIES ('transactional'='true');
      create table t2 (id int, val varchar(10)) stored as orc TBLPROPERTIES ('transactional'='true');
      
      EXPLAIN CBO SELECT id FROM t1 WHERE NULL IN (SELECT NULL FROM t2 where t1.id = t2.id);
      

      The CBO is able to derive that part of the query is empty and ends up with the following plan.

      CBO PLAN:
      HiveProject(id=[$0])
        LogicalCorrelate(correlation=[$cor0], joinType=[semi], requiredColumns=[{}])
          HiveTableScan(table=[[default, t1]], table:alias=[t1])
          HiveValues(tuples=[[]])
      

      The presence of LogicalCorrelate is first redundant but also problematic since many parts of the optimizer assume that queries are decorrelated and do not know how to handle the LogicalCorrelate.

      In the presence of views the same query can lead to the following exception during compilation.

      CREATE MATERIALIZED VIEW v1 AS SELECT id FROM t2;
      EXPLAIN CBO SELECT id FROM t1 WHERE NULL IN (SELECT NULL FROM t2 where t1.id = t2.id);
      
      org.apache.calcite.plan.RelOptPlanner$CannotPlanException: There are not enough rules to produce a node with desired properties: convention=HIVE, sort=[], dist=any. All the inputs have relevant nodes, however the cost is still infinite.
      Root: rel#185:RelSubset#3.HIVE.[].any
      Original rel:
      HiveProject(id=[$0]): rowcount = 4.0, cumulative cost = {20.0 rows, 13.0 cpu, 0.0 io}, id = 178
        LogicalCorrelate(correlation=[$cor0], joinType=[semi], requiredColumns=[{}]): rowcount = 4.0, cumulative cost = {16.0 rows, 9.0 cpu, 0.0 io}, id = 176
          HiveTableScan(table=[[default, t1]], table:alias=[t1]): rowcount = 4.0, cumulative cost = {4.0 rows, 5.0 cpu, 0.0 io}, id = 111
          HiveValues(tuples=[[]]): rowcount = 1.0, cumulative cost = {1.0 rows, 1.0 cpu, 0.0 io}, id = 139
      
      Sets:
      Set#0, type: RecordType(INTEGER id, VARCHAR(10) val, BIGINT BLOCK__OFFSET__INSIDE__FILE, VARCHAR(2147483647) INPUT__FILE__NAME, RecordType(BIGINT writeid, INTEGER bucketid, BIGINT rowid) ROW__ID, BOOLEAN ROW__IS__DELETED)
      	rel#180:RelSubset#0.HIVE.[].any, best=rel#111
      		rel#111:HiveTableScan.HIVE.[].any(table=[default, t1],htColumns=[0, 1, 2, 3, 4, 5],insideView=false,plKey=default.t1;,table:alias=t1,tableScanTrait=null), rowcount=4.0, cumulative cost={4.0 rows, 5.0 cpu, 0.0 io}
      Set#1, type: RecordType(NULL _o__c0)
      	rel#181:RelSubset#1.HIVE.[].any, best=rel#139
      		rel#139:HiveValues.HIVE.[].any(type=RecordType(NULL _o__c0),tuples=[]), rowcount=1.0, cumulative cost={1.0 rows, 1.0 cpu, 0.0 io}
      Set#2, type: RecordType(INTEGER id, VARCHAR(10) val, BIGINT BLOCK__OFFSET__INSIDE__FILE, VARCHAR(2147483647) INPUT__FILE__NAME, RecordType(BIGINT writeid, INTEGER bucketid, BIGINT rowid) ROW__ID, BOOLEAN ROW__IS__DELETED)
      	rel#183:RelSubset#2.NONE.[].any, best=null
      		rel#182:LogicalCorrelate.NONE.[].any(left=RelSubset#180,right=RelSubset#181,correlation=$cor0,joinType=semi,requiredColumns={}), rowcount=4.0, cumulative cost={inf}
      Set#3, type: RecordType(INTEGER id)
      	rel#185:RelSubset#3.HIVE.[].any, best=null
      		rel#184:HiveProject.HIVE.[].any(input=RelSubset#183,inputs=0,synthetic=false), rowcount=4.0, cumulative cost={inf}
      
      Graphviz:
      digraph G {
      	root [style=filled,label="Root"];
      	subgraph cluster0{
      		label="Set 0 RecordType(INTEGER id, VARCHAR(10) val, BIGINT BLOCK__OFFSET__INSIDE__FILE, VARCHAR(2147483647) INPUT__FILE__NAME, RecordType(BIGINT writeid, INTEGER bucketid, BIGINT rowid) ROW__ID, BOOLEAN ROW__IS__DELETED)";
      		rel111 [label="rel#111:HiveTableScan\ntable=[default, t1],htColumns=[0, 1, 2, 3, 4, 5],insideView=false,plKey=default.t1;,table:alias=t1,tableScanTrait=null\nrows=4.0, cost={4.0 rows, 5.0 cpu, 0.0 io}",color=blue,shape=box]
      		subset180 [label="rel#180:RelSubset#0.HIVE.[].any"]
      	}
      	subgraph cluster1{
      		label="Set 1 RecordType(NULL _o__c0)";
      		rel139 [label="rel#139:HiveValues\ntype=RecordType(NULL _o__c0),tuples=[]\nrows=1.0, cost={1.0 rows, 1.0 cpu, 0.0 io}",color=blue,shape=box]
      		subset181 [label="rel#181:RelSubset#1.HIVE.[].any"]
      	}
      	subgraph cluster2{
      		label="Set 2 RecordType(INTEGER id, VARCHAR(10) val, BIGINT BLOCK__OFFSET__INSIDE__FILE, VARCHAR(2147483647) INPUT__FILE__NAME, RecordType(BIGINT writeid, INTEGER bucketid, BIGINT rowid) ROW__ID, BOOLEAN ROW__IS__DELETED)";
      		rel182 [label="rel#182:LogicalCorrelate\nleft=RelSubset#180,right=RelSubset#181,correlation=$cor0,joinType=semi,requiredColumns={}\nrows=4.0, cost={inf}",shape=box]
      		subset183 [label="rel#183:RelSubset#2.NONE.[].any"]
      	}
      	subgraph cluster3{
      		label="Set 3 RecordType(INTEGER id)";
      		rel184 [label="rel#184:HiveProject\ninput=RelSubset#183,inputs=0,synthetic=false\nrows=4.0, cost={inf}",shape=box]
      		subset185 [label="rel#185:RelSubset#3.HIVE.[].any"]
      	}
      	root -> subset185;
      	subset180 -> rel111[color=blue];
      	subset181 -> rel139[color=blue];
      	subset183 -> rel182; rel182 -> subset180[label="0"]; rel182 -> subset181[label="1"];
      	subset185 -> rel184; rel184 -> subset183;
      }
      	at org.apache.calcite.plan.volcano.RelSubset$CheapestPlanReplacer.visit(RelSubset.java:742) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.calcite.plan.volcano.RelSubset.buildCheapestPlan(RelSubset.java:365) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:520) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.applyMaterializedViewRewriting(CalcitePlanner.java:2058) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1722) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1591) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.calcite.tools.Frameworks.lambda$withPlanner$0(Frameworks.java:131) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.calcite.prepare.CalcitePrepareImpl.perform(CalcitePrepareImpl.java:914) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.calcite.tools.Frameworks.withPrepare(Frameworks.java:180) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.calcite.tools.Frameworks.withPlanner(Frameworks.java:126) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.hadoop.hive.ql.parse.CalcitePlanner.logicalPlan(CalcitePlanner.java:1343) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:570) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12820) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:465) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:326) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:180) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:326) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:224) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:107) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:519) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:471) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:436) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:430) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:121) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:227) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:257) ~[hive-cli-4.0.0-SNAPSHOT.jar:?]
      	at org.apache.hadoop.hive.cli.CliDriver.processCmd1(CliDriver.java:201) ~[hive-cli-4.0.0-SNAPSHOT.jar:?]
      	at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:127) ~[hive-cli-4.0.0-SNAPSHOT.jar:?]
      	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:425) ~[hive-cli-4.0.0-SNAPSHOT.jar:?]
      	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:356) ~[hive-cli-4.0.0-SNAPSHOT.jar:?]
      	at org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:733) ~[hive-it-util-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:703) ~[hive-it-util-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:115) ~[hive-it-util-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:157) ~[hive-it-util-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver(TestMiniLlapLocalCliDriver.java:62) ~[test-classes/:?]
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_261]
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_261]
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_261]
      	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_261]
      	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) ~[junit-4.13.2.jar:4.13.2]
      	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) ~[junit-4.13.2.jar:4.13.2]
      	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) ~[junit-4.13.2.jar:4.13.2]
      	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) ~[junit-4.13.2.jar:4.13.2]
      	at org.apache.hadoop.hive.cli.control.CliAdapter$2$1.evaluate(CliAdapter.java:135) ~[hive-it-util-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) ~[junit-4.13.2.jar:4.13.2]
      	at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) ~[junit-4.13.2.jar:4.13.2]
      	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) ~[junit-4.13.2.jar:4.13.2]
      	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) ~[junit-4.13.2.jar:4.13.2]
      	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) ~[junit-4.13.2.jar:4.13.2]
      	at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) ~[junit-4.13.2.jar:4.13.2]
      	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) ~[junit-4.13.2.jar:4.13.2]
      	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) ~[junit-4.13.2.jar:4.13.2]
      	at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) ~[junit-4.13.2.jar:4.13.2]
      	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) ~[junit-4.13.2.jar:4.13.2]
      	at org.junit.runners.ParentRunner.run(ParentRunner.java:413) ~[junit-4.13.2.jar:4.13.2]
      	at org.junit.runners.Suite.runChild(Suite.java:128) ~[junit-4.13.2.jar:4.13.2]
      	at org.junit.runners.Suite.runChild(Suite.java:27) ~[junit-4.13.2.jar:4.13.2]
      	at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) ~[junit-4.13.2.jar:4.13.2]
      	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) ~[junit-4.13.2.jar:4.13.2]
      	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) ~[junit-4.13.2.jar:4.13.2]
      	at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) ~[junit-4.13.2.jar:4.13.2]
      	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) ~[junit-4.13.2.jar:4.13.2]
      	at org.apache.hadoop.hive.cli.control.CliAdapter$1$1.evaluate(CliAdapter.java:95) ~[hive-it-util-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
      	at org.junit.rules.RunRules.evaluate(RunRules.java:20) ~[junit-4.13.2.jar:4.13.2]
      	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) ~[junit-4.13.2.jar:4.13.2]
      	at org.junit.runners.ParentRunner.run(ParentRunner.java:413) ~[junit-4.13.2.jar:4.13.2]
      	at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) ~[surefire-junit4-3.0.0-M4.jar:3.0.0-M4]
      	at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) ~[surefire-junit4-3.0.0-M4.jar:3.0.0-M4]
      	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) ~[surefire-junit4-3.0.0-M4.jar:3.0.0-M4]
      	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) ~[surefire-junit4-3.0.0-M4.jar:3.0.0-M4]
      	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:377) ~[surefire-booter-3.0.0-M4.jar:3.0.0-M4]
      	at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:138) ~[surefire-booter-3.0.0-M4.jar:3.0.0-M4]
      	at org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:465) ~[surefire-booter-3.0.0-M4.jar:3.0.0-M4]
      	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:451) ~[surefire-booter-3.0.0-M4.jar:3.0.0-M4]
      

      The goal of this ticket is to get rid of the redundant correlation to avoid compilation failures but also for unlocking further simplifications and improving plan readability.

      The plan can be simplified further based on the following observations.

      If the right side of the correlate is empty then the whole correlate is empty when joinType is SEMI/INNER. Moreover if correlate type is LEFT then we can also drop the correlate and use t1 padded with nulls for the right side. Lastly, if the type is ANTI then result is the entire t1 so the correlate can also be dropped. RIGHT and FULL correlations are invalid and should never appear in the plan.

      If the left side of the correlate is empty the result is empty and the correlation can be dropped for every legal joinType (INNER/SEMI/ANTI/LEFT).

      Attachments

        Issue Links

          Activity

            People

              zabetak Stamatis Zampetakis
              zabetak Stamatis Zampetakis
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 20m
                  2h 20m