Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.13.1
-
None
-
None
Description
Hive doesn't seem to be able to write NULL values in a column of type "struct". Instead, it replaces them by empty objects (= non NULL objects containing only NULL values).
Here is a short example demonstrating the issue. We start with a small Avro table "avro_table".
SELECT * from avro_table
mycol |
---|
struct<field1:string,field2:double> |
{"field1":"blabla","field2":1.0} |
{"field1":"blabla","field2":2.0} |
NULL |
{"field1":"blabla","field2":4.0} |
{"field1":"blabla","field2":5.0} |
As you can see here, the third row contains a NULL cell. Then, let's copy it using Hive (INSERT OVERWRITE ...) into a Parquet table named "parquet_table".
Finally, when you try to display it:
SELECT * from parquet_table
mycol |
---|
struct<field1:string,field2:double> |
{"field1":"blabla","field2":1.0} |
{"field1":"blabla","field2":2.0} |
{"field1":null,"field2":null} |
{"field1":"blabla","field2":4.0} |
{"field1":"blabla","field2":5.0} |
I tried to generate a (correct) Parquet file using our software (Dataiku), and Hive had no problem reading null values, even when the column type was "struct".
Consequently, I suspect the bug to be located in the Parquet writer code.
This bug also recursively propagates to nested types. For instance a NULL cell of type
struct<field1:struct<field3:string>,field2:double>
will be become
{"field1":{"field3":null},"field2":null}
when written in a Parquet file.