Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28478

Optimizer rule to remove unnecessary explicit null checks for null-intolerant expressions (e.g. if(x is null, x, f(x)))



    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • SQL
    • None


      I ran across a family of expressions like

      if(x is null, x, substring(x, 0, 1024))


      when($"x".isNull, $"x", substring($"x", 0, 1024))

      that were written this way because the query author was unsure about whether substring would return null when its input string argument is null.

      This explicit null-handling is unnecessary and adds bloat to the generated code, especially if it's done via a CASE statement (which compiles down to a do-while loop).

      In another case I saw a query compiler which automatically generated this type of code.

      It would be cool if Spark could automatically optimize such queries to remove these redundant null checks. Here's a sketch of what such a rule might look like (assuming that SPARK-28477 has been implement so we only need to worry about the IF case):

      • In the pattern match, check the following three conditions in the following order (to benefit from short-circuiting)
        • The IF condition is an explicit null-check of a column c
        • The true expression returns either c or null
        • The false expression is a null-intolerant expression with c as a direct child. 
      • If this condition matches, replace the entire If with the false branch's expression..



        Issue Links



              Unassigned Unassigned
              joshrosen Josh Rosen
              0 Vote for this issue
              4 Start watching this issue