Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
I'm currently working on adding support for writing min/max statistics to Parquet files to Impala (IMPALA-3909). I noticed, that the comments in parquet.thrift#L201 don't seem to match the implementations in parquet-mr and Hive.
The comments ask for min/max statistics to be "encoded in PLAIN encoding". For strings (BYTE_ARRAY), this should be "4 byte length stored as little endian, followed by bytes".
Looking at BinaryStatistics.java#L61, it seems to return the bytes without a length-prefix. Writing a parquet file with Hive also shows this behavior.
Similarly, but less ambiguous, PLAIN encoding for booleans uses bit-packing. It seems to be implied that for a single bit (min/max of a boolean column) it means setting the least significant bit of a single byte. This could be made more clear in the parquet.thrift file, too.