[HIVE-8419] Hive doesn't properly write NULL values in Parquet files when the type is struct<...>. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.13.1
Fix Version/s: None
Component/s: File Formats
Labels:
None

Description

Hive doesn't seem to be able to write NULL values in a column of type "struct". Instead, it replaces them by empty objects (= non NULL objects containing only NULL values).

Here is a short example demonstrating the issue. We start with a small Avro table "avro_table".

 SELECT  * from avro_table

mycol
struct<field1:string,field2:double>
{"field1":"blabla","field2":1.0}
{"field1":"blabla","field2":2.0}
NULL
{"field1":"blabla","field2":4.0}
{"field1":"blabla","field2":5.0}

As you can see here, the third row contains a NULL cell. Then, let's copy it using Hive (INSERT OVERWRITE ...) into a Parquet table named "parquet_table".

Finally, when you try to display it:

 SELECT  * from parquet_table

mycol
struct<field1:string,field2:double>
{"field1":"blabla","field2":1.0}
{"field1":"blabla","field2":2.0}
{"field1":null,"field2":null}
{"field1":"blabla","field2":4.0}
{"field1":"blabla","field2":5.0}

I tried to generate a (correct) Parquet file using our software (Dataiku), and Hive had no problem reading null values, even when the column type was "struct".

Consequently, I suspect the bug to be located in the Parquet writer code.

This bug also recursively propagates to nested types. For instance a NULL cell of type

 struct<field1:struct<field3:string>,field2:double>

will be become

 {"field1":{"field3":null},"field2":null}

when written in a Parquet file.

Attachments

Activity

People

Assignee:: Sergio Peña

Reporter:: Frédéric TERRAZZONI

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 09/Oct/14 16:18

Updated:: 06/Dec/14 00:28

Resolved:: 06/Dec/14 00:28