Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Implemented
-
Impala 4.0.0
-
ghx-label-7
Description
Parquet stats should be written to iceberg manifest as per-datafile metrics.
This task is specifically about the following metrics:
- column_sizes : Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats
- null_value_counts : Map from column id to number of null values in the column.
- lower_bounds : Map from column id to lower bound in the column serialized as binary. Each value must be less than or equal to all non-null, non-NaN values in the column for the file.
- upper_bounds : Map from column id to upper bound in the column serialized as binary. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file.
Iceberg manifest doc:
https://iceberg.apache.org/spec/#manifests
lower_bounds and upper_bounds values should be Single-value serialized to binary:
https://iceberg.apache.org/spec/#appendix-d-single-value-serialization