Description
I propose the following expression rewrite optimizations:
NOT isnull(x) -> isnotnull(x) NOT isnotnull(x) -> isnull(x)
This might seem contrived, but I saw negated versions of these expressions appear in a user-written query after that query had undergone optimization. For example:
spark.createDataset(Seq[(String, java.lang.Boolean)](("true", true), ("false", false), ("null", null))).write.parquet("/tmp/bools") spark.read.parquet("/tmp/bools").where("not(isnull(_2) or _2 == false)").explain spark.read.parquet("/tmp/bools").where("not(isnull(_2) or _2 == false)").explain(true) == Parsed Logical Plan == 'Filter NOT ('isnull('_2) OR ('_2 = false)) +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools == Analyzed Logical Plan == _1: string, _2: boolean Filter NOT (isnull(_2#5) OR (_2#5 = false)) +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools == Optimized Logical Plan == Filter ((isnotnull(_2#5) AND NOT isnull(_2#5)) AND NOT (_2#5 = false)) +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools == Physical Plan == *(1) Project [_1#4, _2#5] +- *(1) Filter ((isnotnull(_2#5) AND NOT isnull(_2#5)) AND NOT (_2#5 = false)) +- *(1) ColumnarToRow +- BatchScan[_1#4, _2#5] ParquetScan Location: InMemoryFileIndex[file:/tmp/bools], ReadSchema: struct<_1:string,_2:boolean>
This rewrite is also useful for query canonicalization.
Attachments
Issue Links
- links to