Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-10629

bin/load-data.py does not respect compression codec for parquet

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Impala 4.0.0
    • Impala 4.0.0
    • Infrastructure
    • None
    • ghx-label-4

    Description

      If I try to use bin/load-data.py to load TPC-H as ZSTD compressed Parquet, it silently ignores the codec and uses Snappy under the covers:

      $ bin/load-data.py -w tpch --table_formats=parquet/zstd
      $ hdfs dfs -ls /test-warehouse/tpch.lineitem_parquet_zstd/
      Found 4 items
      -rw-r--r--   3 joe supergroup   72305126 2021-03-31 17:01 /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c900000000_1779607968_data.0.parq
      -rw-r--r--   3 joe supergroup   58526717 2021-03-31 17:01 /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c900000001_53336944_data.0.parq
      -rw-r--r--   3 joe supergroup   72584796 2021-03-31 17:01 /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c900000002_53336944_data.0.parq
      drwxr-xr-x   - joe supergroup          0 2021-03-31 17:01 /test-warehouse/tpch.lineitem_parquet_zstd/_impala_insert_staging
      $ hdfs dfs -copyToLocal /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c900000002_53336944_data.0.parq
      $ parquet-reader 02444051906c734d-3b49d6c900000002_53336944_data.0.parq
      ...
              [10] = ColumnChunk {
                02: file_offset (i64) = 37053592,
                03: meta_data (struct) = ColumnMetaData {
                  01: type (i32) = 6,
                  02: encodings (list) = list<i32>[2] {
                    [0] = 2,
                    [1] = 3,
                  },
                  03: path_in_schema (list) = list<string>[1] {
                    [0] = "l_shipdate",
                  },
                  04: codec (i32) = 1, <------ SNAPPY!!!!
      
      ...

      Based on what I'm seeing, bin/load-data.py doesn't set the compression_codec query option when loading parquet. It is a bug that this silently does the wrong thing, but the actual support is more of a feature request.

      Being able to load ZSTD (or other compression) parquet makes it easier to do performance comparisons for those compression codecs on the perf-AB-test upstream job (https://jenkins.impala.io/job/perf-AB-test/).

      Attachments

        Activity

          People

            Unassigned Unassigned
            joemcdonnell Joe McDonnell
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: