[SPARK-31809] Infer IsNotNull for non null intolerant child of null intolerant in join condition - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

We should infer IsNotNull for all children of NullIntolerant expressions. For example:

CREATE TABLE t1(c1 string, c2 string);
CREATE TABLE t2(c1 string, c2 string);
EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON coalesce(t1.c1, t1.c2)=t2.c1;

== Physical Plan ==
*(4) Project [c1#5, c2#6]
+- *(4) SortMergeJoin [coalesce(c1#5, c2#6)], [c1#7], Inner
   :- *(1) Sort [coalesce(c1#5, c2#6) ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(coalesce(c1#5, c2#6), 200), true, [id=#33]
   :     +- Scan hive default.t1 [c1#5, c2#6], HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#5, c2#6], Statistics(sizeInBytes=8.0 EiB)
   +- *(3) Sort [c1#7 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(c1#7, 200), true, [id=#46]
         +- *(2) Filter isnotnull(c1#7)
            +- Scan hive default.t2 [c1#7], HiveTableRelation `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8], Statistics(sizeInBytes=8.0 EiB)

We should infer coalesce(t1.c1, t1.c2) IS NOT NULL to improve query performance:

== Physical Plan ==
*(5) Project [c1#23, c2#24]
+- *(5) SortMergeJoin [coalesce(c1#23, c2#24)], [c1#25], Inner
   :- *(2) Sort [coalesce(c1#23, c2#24) ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(coalesce(c1#23, c2#24), 200), true, [id=#95]
   :     +- *(1) Filter isnotnull(coalesce(c1#23, c2#24))
   :        +- Scan hive default.t1 [c1#23, c2#24], HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#23, c2#24], Statistics(sizeInBytes=8.0 EiB)
   +- *(4) Sort [c1#25 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(c1#25, 200), true, [id=#103]
         +- *(3) Filter isnotnull(c1#25)
            +- Scan hive default.t2 [c1#25], HiveTableRelation `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#25, c2#26], Statistics(sizeInBytes=8.0 EiB)

Real performance test case:

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

default.png
25/May/20 07:32
442 kB
Yuming Wang
infer.png
25/May/20 07:32
459 kB
Yuming Wang

Issue Links

links to

[Github] Pull Request #28642 (wangyum)

Activity

People

Assignee:: Yuming Wang

Reporter:: Yuming Wang

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 25/May/20 07:31

Updated:: 26/Jul/21 01:40