[SPARK-28478] Optimizer rule to remove unnecessary explicit null checks for null-intolerant expressions (e.g. if(x is null, x, f(x))) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

I ran across a family of expressions like

if(x is null, x, substring(x, 0, 1024))

when($"x".isNull, $"x", substring($"x", 0, 1024))

that were written this way because the query author was unsure about whether substring would return null when its input string argument is null.

This explicit null-handling is unnecessary and adds bloat to the generated code, especially if it's done via a CASE statement (which compiles down to a do-while loop).

In another case I saw a query compiler which automatically generated this type of code.

It would be cool if Spark could automatically optimize such queries to remove these redundant null checks. Here's a sketch of what such a rule might look like (assuming that ~~SPARK-28477~~ has been implement so we only need to worry about the IF case):

In the pattern match, check the following three conditions in the following order (to benefit from short-circuiting)
- The IF condition is an explicit null-check of a column c
- The true expression returns either c or null
- The false expression is a null-intolerant expression with c as a direct child.
If this condition matches, replace the entire If with the false branch's expression..

Attachments

Issue Links

links to

GitHub Pull Request #27231

Activity

People

Assignee:: Unassigned

Reporter:: Josh Rosen

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 23/Jul/19 02:02

Updated:: 30/Apr/20 16:49