Status: In Progress
I ran across a family of expressions like
that were written this way because the query author was unsure about whether substring would return null when its input string argument is null.
This explicit null-handling is unnecessary and adds bloat to the generated code, especially if it's done via a CASE statement (which compiles down to a do-while loop).
In another case I saw a query compiler which automatically generated this type of code.
It would be cool if Spark could automatically optimize such queries to remove these redundant null checks. Here's a sketch of what such a rule might look like (assuming that
SPARK-28477 has been implement so we only need to worry about the IF case):
- In the pattern match, check the following three conditions in the following order (to benefit from short-circuiting)
- The IF condition is an explicit null-check of a column c
- The true expression returns either c or null
- The false expression is a null-intolerant expression with c as a direct child.
- If this condition matches, replace the entire If with the false branch's expression..