Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.0.2
-
None
-
Important
Description
When joining two tables via LEFT JOIN, columns in right table may be NULLs, however catalyst codegen cannot recognize it.
Example:
schema.sql
create table masterdata.testtable( id int not null, age int ); create table masterdata.jointable( id int not null, name text not null );
query_to_select.sql
(select t.id, t.age, j.name from masterdata.testtable t left join masterdata.jointable j on t.id = j.id) as testtable;
master code
val df = sqlContext .read .format("jdbc") .option("dbTable", "query to select") .... .load //df generated schema /* root |-- id: integer (nullable = false) |-- age: integer (nullable = true) |-- name: string (nullable = false) */
Codegen
/* 038 */ scan_rowWriter.write(0, scan_value); /* 039 */ /* 040 */ if (scan_isNull1) { /* 041 */ scan_rowWriter.setNullAt(1); /* 042 */ } else { /* 043 */ scan_rowWriter.write(1, scan_value1); /* 044 */ } /* 045 */ /* 046 */ scan_rowWriter.write(2, scan_value2);
Since j.name is from right table of left join query, it may be null. However generated schema doesn't think so (probably because it defined as name text not null)
StackTrace
java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.debug.package$DebugExec$$anonfun$3$$anon$1.hasNext(package.scala:146) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1763) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)