Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40963

ExtractGenerator sets incorrect nullability in new Project

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.3, 3.2.2, 3.3.1, 3.4.0
    • 3.2.3, 3.3.2, 3.4.0
    • SQL

    Description

      Example:

      select c1, explode(c4) as c5 from (
        select c1, array(c3) as c4 from (
          select c1, explode_outer(c2) as c3
          from values
          (1, array(1, 2)),
          (2, array(2, 3)),
          (3, null)
          as data(c1, c2)
        )
      );
      
      +---+---+
      |c1 |c5 |
      +---+---+
      |1  |1  |
      |1  |2  |
      |2  |2  |
      |2  |3  |
      |3  |0  |
      +---+---+
      

      In the last row, c5 is 0, but should be NULL.

      Another example:

      select c1, exists(c4, x -> x is null) as c5 from (
        select c1, array(c3) as c4 from (
          select c1, explode_outer(c2) as c3
          from values
          (1, array(1, 2)),
          (2, array(2, 3)),
          (3, null)
          as data(c1, c2)
        )
      );
      
      +---+-----+
      |c1 |c5   |
      +---+-----+
      |1  |false|
      |1  |false|
      |2  |false|
      |2  |false|
      |3  |false|
      +---+-----+
      

      In the last row, false should be true.

      In both cases, at the time CreateArray(c3) is instantiated, c3's nullability is incorrect because the new projection created by ExtractGenerator uses generatorOutput from explode_outer(c2) as a projection list. generatorOutput doesn't take into account that explode_outer(c2) is an outer explode, so the nullability setting is lost.

      UpdateAttributeNullability will eventually fix the nullable setting for attributes referring to c3, but it doesn't fix the containsNull setting for c4 in explode(c4) (from the first example) or exists(c4, x -> x is null) (from the second example).

      This example fails with a NullPointerException:

      select c1, inline_outer(c4) from (
        select c1, array(c3) as c4 from (
          select c1, explode_outer(c2) as c3
          from values
          (1, array(named_struct('a', 1, 'b', 2))),
          (2, array(named_struct('a', 3, 'b', 4), named_struct('a', 5, 'b', 6))),
          (3, null)
          as data(c1, c2)
        )
      );
      
      22/10/27 11:53:20 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2)
      java.lang.NullPointerException
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_1$(Unknown Source)
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown Source)
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
      	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
      	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
      	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
      

      Attachments

        Activity

          People

            bersprockets Bruce Robbins
            bersprockets Bruce Robbins
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: