Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20359

Catalyst EliminateOuterJoin optimization can cause NPE

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.1.0
    • 2.1.1, 2.2.0, 2.3.0
    • SQL
    • None
    • spark master at commit 35e5ae4f81176af52569c465520a703529893b50 (Sun Apr 16)

    Description

      we were running in to an NPE in one of our UDFs for spark sql.

      now this particular function indeed could not handle nulls, but this was by design since null input was never allowed (and we would want it to blow up if there was a null as input).

      we realized the issue was not in our data when we added filters for nulls and the NPE still happened. then we also saw the NPE when just doing dataframe.explain instead of running our job.

      turns out the issue is in EliminateOuterJoin.canFilterOutNull where a row with all nulls ifs fed into the expression as a test. its the line:
      val v = boundE.eval(emptyRow)

      i believe it is a bug to assume the expression can always handle nulls.

      for example this fails:

      val df1 = Seq("a", "b", "c").toDF("x")
        .withColumn("y", udf{ (x: String) => x.substring(0, 1) + "!" }.apply($"x"))
      val df2 = Seq("a", "b").toDF("x1")
      df1
        .join(df2, df1("x") === df2("x1"), "left_outer")
        .filter($"x1".isNotNull || !$"y".isin("a!"))
        .count
      

      Attachments

        Issue Links

          Activity

            People

              koertkuipers Koert Kuipers
              koert koert kuipers
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: