Details
-
Improvement
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
3.1.0
-
None
-
None
Description
I ran across a family of expressions like
if(x is null, x, substring(x, 0, 1024))
or
when($"x".isNull, $"x", substring($"x", 0, 1024))
that were written this way because the query author was unsure about whether substring would return null when its input string argument is null.
This explicit null-handling is unnecessary and adds bloat to the generated code, especially if it's done via a CASE statement (which compiles down to a do-while loop).
In another case I saw a query compiler which automatically generated this type of code.
It would be cool if Spark could automatically optimize such queries to remove these redundant null checks. Here's a sketch of what such a rule might look like (assuming that SPARK-28477 has been implement so we only need to worry about the IF case):
- In the pattern match, check the following three conditions in the following order (to benefit from short-circuiting)
- The IF condition is an explicit null-check of a column c
- The true expression returns either c or null
- The false expression is a null-intolerant expression with c as a direct child.
- If this condition matches, replace the entire If with the false branch's expression..
Attachments
Issue Links
- links to