Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31809

Infer IsNotNull for non null intolerant child of null intolerant in join condition

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • SQL
    • None

    Description

      We should infer IsNotNull for all children of NullIntolerant expressions. For example:

      CREATE TABLE t1(c1 string, c2 string);
      CREATE TABLE t2(c1 string, c2 string);
      EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON coalesce(t1.c1, t1.c2)=t2.c1;
      
      == Physical Plan ==
      *(4) Project [c1#5, c2#6]
      +- *(4) SortMergeJoin [coalesce(c1#5, c2#6)], [c1#7], Inner
         :- *(1) Sort [coalesce(c1#5, c2#6) ASC NULLS FIRST], false, 0
         :  +- Exchange hashpartitioning(coalesce(c1#5, c2#6), 200), true, [id=#33]
         :     +- Scan hive default.t1 [c1#5, c2#6], HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#5, c2#6], Statistics(sizeInBytes=8.0 EiB)
         +- *(3) Sort [c1#7 ASC NULLS FIRST], false, 0
            +- Exchange hashpartitioning(c1#7, 200), true, [id=#46]
               +- *(2) Filter isnotnull(c1#7)
                  +- Scan hive default.t2 [c1#7], HiveTableRelation `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8], Statistics(sizeInBytes=8.0 EiB)
      

      We should infer coalesce(t1.c1, t1.c2) IS NOT NULL to improve query performance:

      == Physical Plan ==
      *(5) Project [c1#23, c2#24]
      +- *(5) SortMergeJoin [coalesce(c1#23, c2#24)], [c1#25], Inner
         :- *(2) Sort [coalesce(c1#23, c2#24) ASC NULLS FIRST], false, 0
         :  +- Exchange hashpartitioning(coalesce(c1#23, c2#24), 200), true, [id=#95]
         :     +- *(1) Filter isnotnull(coalesce(c1#23, c2#24))
         :        +- Scan hive default.t1 [c1#23, c2#24], HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#23, c2#24], Statistics(sizeInBytes=8.0 EiB)
         +- *(4) Sort [c1#25 ASC NULLS FIRST], false, 0
            +- Exchange hashpartitioning(c1#25, 200), true, [id=#103]
               +- *(3) Filter isnotnull(c1#25)
                  +- Scan hive default.t2 [c1#25], HiveTableRelation `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#25, c2#26], Statistics(sizeInBytes=8.0 EiB)
      

      Real performance test case:

      Attachments

        1. default.png
          442 kB
          Yuming Wang
        2. infer.png
          459 kB
          Yuming Wang

        Activity

          People

            yumwang Yuming Wang
            yumwang Yuming Wang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: