Details
-
Improvement
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
3.1.0
-
None
-
None
Description
We should infer IsNotNull for all children of NullIntolerant expressions. For example:
CREATE TABLE t1(c1 string, c2 string); CREATE TABLE t2(c1 string, c2 string); EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON coalesce(t1.c1, t1.c2)=t2.c1;
== Physical Plan == *(4) Project [c1#5, c2#6] +- *(4) SortMergeJoin [coalesce(c1#5, c2#6)], [c1#7], Inner :- *(1) Sort [coalesce(c1#5, c2#6) ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(coalesce(c1#5, c2#6), 200), true, [id=#33] : +- Scan hive default.t1 [c1#5, c2#6], HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#5, c2#6], Statistics(sizeInBytes=8.0 EiB) +- *(3) Sort [c1#7 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(c1#7, 200), true, [id=#46] +- *(2) Filter isnotnull(c1#7) +- Scan hive default.t2 [c1#7], HiveTableRelation `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8], Statistics(sizeInBytes=8.0 EiB)
We should infer coalesce(t1.c1, t1.c2) IS NOT NULL to improve query performance:
== Physical Plan == *(5) Project [c1#23, c2#24] +- *(5) SortMergeJoin [coalesce(c1#23, c2#24)], [c1#25], Inner :- *(2) Sort [coalesce(c1#23, c2#24) ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(coalesce(c1#23, c2#24), 200), true, [id=#95] : +- *(1) Filter isnotnull(coalesce(c1#23, c2#24)) : +- Scan hive default.t1 [c1#23, c2#24], HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#23, c2#24], Statistics(sizeInBytes=8.0 EiB) +- *(4) Sort [c1#25 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(c1#25, 200), true, [id=#103] +- *(3) Filter isnotnull(c1#25) +- Scan hive default.t2 [c1#25], HiveTableRelation `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#25, c2#26], Statistics(sizeInBytes=8.0 EiB)
Real performance test case: