Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18859

Catalyst codegen does not mark column as nullable when it should. Causes NPE

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.0.2
    • None
    • Optimizer, SQL
    • Important

    Description

      When joining two tables via LEFT JOIN, columns in right table may be NULLs, however catalyst codegen cannot recognize it.
      Example:

      schema.sql
      create table masterdata.testtable(
        id int not null,
        age int
      );
      create table masterdata.jointable(
        id int not null,
        name text not null
      );
      
      
      query_to_select.sql
      (select t.id, t.age, j.name from masterdata.testtable t left join masterdata.jointable j on t.id = j.id) as testtable;
      
      master code
      val df = sqlContext
            .read
            .format("jdbc")
            .option("dbTable", "query to select")
            ....
            .load
      //df generated schema
      /*
      root
       |-- id: integer (nullable = false)
       |-- age: integer (nullable = true)
       |-- name: string (nullable = false)
      */
      
      Codegen
      /* 038 */       scan_rowWriter.write(0, scan_value);
      /* 039 */
      /* 040 */       if (scan_isNull1) {
      /* 041 */         scan_rowWriter.setNullAt(1);
      /* 042 */       } else {
      /* 043 */         scan_rowWriter.write(1, scan_value1);
      /* 044 */       }
      /* 045 */
      /* 046 */       scan_rowWriter.write(2, scan_value2);
      

      Since j.name is from right table of left join query, it may be null. However generated schema doesn't think so (probably because it defined as name text not null)

      StackTrace
      java.lang.NullPointerException
      	at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
      	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
      	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
      	at org.apache.spark.sql.execution.debug.package$DebugExec$$anonfun$3$$anon$1.hasNext(package.scala:146)
      	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1763)
      	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
      	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
      	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
      	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
      	at org.apache.spark.scheduler.Task.run(Task.scala:86)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            mentegy Mykhailo Osypov
            Votes:
            2 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: