Details
-
Improvement
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
None
-
None
-
ghx-label-6
Description
We can add some rules to optimize the plan after we chose a cheapest plan based on cost. For example, one useful rule can be "removing useless SELECT nodes".
Impala will generated a useless SELECT for the following query:
SELECT t.id, t.int_col FROM functional.alltypestiny t LEFT JOIN (SELECT id, int_col FROM functional.alltypestiny) t2 ON (t.id = t2.id) WHERE t.int_col = t.id UNION ALL VALUES (NULL, NULL)
Its single node plan is
PLAN-ROOT SINK | 00:UNION | constant-operands=1 | row-size=8B cardinality=1 | 04:SELECT | predicates: t.id = t.int_col | row-size=12B cardinality=0 | 03:HASH JOIN [RIGHT OUTER JOIN] | hash predicates: id = t.id | runtime filters: RF000 <- t.id | row-size=12B cardinality=1 | |--01:SCAN HDFS [functional.alltypestiny t] | HDFS partitions=4/4 files=4 size=460B | predicates: t.int_col = t.id | row-size=8B cardinality=1 | 02:SCAN HDFS [functional.alltypestiny] HDFS partitions=4/4 files=4 size=460B runtime filters: RF000 -> id row-size=4B cardinality=8
The SELECT node (id=04) is useless since its only predicate "t.id = t.int_col" has been enforced in the SCAN node (id=01) which is the right hand side of the RIGHT OUTER JOIN. The SELECT node won't filter out any more rows.