Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4819

Populate min/max statistics in Parquet files for Timestamp values

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: Impala 2.9.0
    • Fix Version/s: Impala 2.9.0
    • Component/s: Backend
    • Labels:

      Description

      Due to several issues in parquet-mr and subsequently Hive, IMPALA-3909 only adds write support for min/max statistics for numeric types. We should also add write support for Timestamp values. This is currently blocked by PARQUET-840 and is related to PARQUET-839.

        Issue Links

          Activity

          Hide
          lv Lars Volker added a comment -

          IMPALA-4815, IMPALA-4817, IMPALA-4819: Write and Read Parquet Statistics for remaining types

          This change adds functionality to write and read parquet::Statistics for
          Decimal, String, and Timestamp values. As an exception, we don't read
          statistics for CHAR columns, since CHAR support is broken in Impala
          (IMPALA-1652).

          This change also switches from using the deprecated fields 'min' and
          'max' to populate the new fields 'min_value' and 'max_value' in
          parquet::Statistics, that were added in parquet-format pull request #46.

          The HdfsParquetScanner will preferably read the new fields if they are
          populated and if the column order 'TypeDefinedOrder' has been used to
          compute the statistics. For columns without a column order set or with
          only the deprecated fields populated, the scanner will read them only if
          they are of simple numeric type, i.e. boolean, integer, or floating
          point.

          This change removes the validation of the Parquet Statistics we write to
          Hive from the tests, since Hive does not write the new fields. Instead
          it adds a parquet file written by Hive that uses the deprecated fields
          for its statistics. It uses that file to exercise the fallback logic for
          supported types in a test.

          This change also cleans up the interface of ParquetPlainEncoder in
          parquet-common.h.

          Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312
          Reviewed-on: http://gerrit.cloudera.org:8080/6563
          Reviewed-by: Lars Volker <lv@cloudera.com>
          Tested-by: Lars Volker <lv@cloudera.com>

          Show
          lv Lars Volker added a comment - IMPALA-4815 , IMPALA-4817 , IMPALA-4819 : Write and Read Parquet Statistics for remaining types This change adds functionality to write and read parquet::Statistics for Decimal, String, and Timestamp values. As an exception, we don't read statistics for CHAR columns, since CHAR support is broken in Impala ( IMPALA-1652 ). This change also switches from using the deprecated fields 'min' and 'max' to populate the new fields 'min_value' and 'max_value' in parquet::Statistics, that were added in parquet-format pull request #46. The HdfsParquetScanner will preferably read the new fields if they are populated and if the column order 'TypeDefinedOrder' has been used to compute the statistics. For columns without a column order set or with only the deprecated fields populated, the scanner will read them only if they are of simple numeric type, i.e. boolean, integer, or floating point. This change removes the validation of the Parquet Statistics we write to Hive from the tests, since Hive does not write the new fields. Instead it adds a parquet file written by Hive that uses the deprecated fields for its statistics. It uses that file to exercise the fallback logic for supported types in a test. This change also cleans up the interface of ParquetPlainEncoder in parquet-common.h. Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312 Reviewed-on: http://gerrit.cloudera.org:8080/6563 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Lars Volker <lv@cloudera.com>

            People

            • Assignee:
              lv Lars Volker
              Reporter:
              lv Lars Volker
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development