Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-10879

Add parquet stats to iceberg manifest

    XMLWordPrintableJSON

Details

    • ghx-label-7

    Description

      Parquet stats should be written to iceberg manifest as per-datafile metrics.

      This task is specifically about the following metrics:

      • column_sizes : Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats
      • null_value_counts : Map from column id to number of null values in the column.
      • lower_bounds : Map from column id to lower bound in the column serialized as binary. Each value must be less than or equal to all non-null, non-NaN values in the column for the file.
      • upper_bounds : Map from column id to upper bound in the column serialized as binary. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file.

      Iceberg manifest doc:
      https://iceberg.apache.org/spec/#manifests

      lower_bounds and upper_bounds values should be Single-value serialized to binary:
      https://iceberg.apache.org/spec/#appendix-d-single-value-serialization

      Attachments

        Activity

          People

            attilaj Attila Jeges
            attilaj Attila Jeges
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: