Description
The HiveTableScan operator unwraps string "NULL" (case insensitive) into null values even if the column type is STRING.
To reproduce the bug, we use sql/hive/src/test/resources/groupby_groupingid.txt as test input, copied to /tmp/groupby_groupingid.txt.
Hive session:
hive> CREATE TABLE test_null(key INT, value STRING);
hive> LOAD DATA LOCAL INPATH '/tmp/groupby_groupingid.txt' INTO table test_null;
hive> SELECT * FROM test_null WHERE value IS NOT NULL;
...
OK
1 NULL
1 1
2 2
3 3
3 NULL
4 5
We can see that the NULL cells in the original input file are interpreted as string "NULL" in Hive.
Spark SQL session (sbt/sbt hive/console):
scala> hql("CREATE TABLE test_null(key INT, value STRING)") scala> hql("LOAD DATA LOCAL INPATH '/tmp/groupby_groupingid.txt' INTO table test_null") scala> hql("SELECT * FROM test_null WHERE value IS NOT NULL").foreach(println) ... [1,1] [2,2] [3,3] [4,5]
As we can see, string "NULL" is interpreted as null values in Spark SQL.
Attachments
Issue Links
- is related to
-
SPARK-3683 PySpark Hive query generates "NULL" instead of None
- Closed