Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28481

More expressions should extend NullIntolerant

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.1.0
    • Fix Version/s: 3.1.0
    • Component/s: SQL
    • Labels:
      None

      Description

      SPARK-13995 introduced the NullIntolerant trait to generalize the logic for inferring IsNotNull constraints from expressions. An expression is null-intolerant if it returns null when any of its input expressions are null.

      I've noticed that most expressions are null-intolerant: anything which extends UnaryExpression / BinaryExpression and keeps the default eval method will be null-intolerant. However, only a subset of these expressions mix in the NullIntolerant trait. As a result, we're missing out on the opportunity to infer certain types of non-null constraints: for example, if we see a WHERE length(x) > 10 condition then we know that the column x must be non-null and can push this non-null filter down to our datasource scan.

      I can think of a few ways to fix this:

      1. Modify every relevant expression to mix in the NullIntolerant trait. We can use IDEs or other code-analysis tools (e.g. ClassUtil plus reflection) to help automate the process of identifying expressions which do not override the default eval.
      2. Make a backwards-incompatible change to our abstract base class hierarchy to add NullSafe*aryExpression abstract base classes which define the nullSafeEval method and implement a final eval method, then leave eval unimplemented in the regular *aryExpression base classes.
        • This would fix the somewhat weird code smell that we have today where nullSafeEval has a default implementation which calls sys.error.
        • This would negatively impact users who have implemented custom Catalyst expressions.
      3. Use runtime reflection to determine whether expressions are null-intolerant by virtue of using one of the default null-intolerant eval implementations. We can then use this in an isNullIntolerant helper method which checks that classes either (a) extend NullIntolerant or (b) are null-intolerant according to the reflective check (which is basically just figuring out which concrete implementation the eval method resolves to).
        • We only need to perform the reflection once per-class and can cache the result for the lifetime of the JVM, so the performance overheads would be pretty small (especially compared to other non-cacheable reflection / traversal costs in Catalyst).
        • The downside is additional complexity in the code which pattern-matches / checks for null-intolerance.

      Of these approaches, I'm currently leaning towards option 1 (semi-automated identification and manual update of hundreds of expressions): if we go with that approach then we can perform a one-time catch-up to fix all existing expressions. To handle ongoing maintenance (as we add new expressions), I'd propose to add "is this null-intolerant?" to a checklist to use when reviewing PRs which add new Catalyst expressions. 

      /cc Takeshi Yamamuro L. C. Hsieh

        Attachments

          Activity

            People

            • Assignee:
              yumwang Yuming Wang
              Reporter:
              joshrosen Josh Rosen
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: