Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3316

convert_legacy_hive_parquet_utc_timestamps=true makes reading parquet tables 30x slower

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • impala 2.3
    • Impala 3.1.0
    • Backend
    • None
    • CDH 5.5.2/ Impala 2.3
      Parquet table with a timestamp column
      Secure cluster
      convert_legacy_hive_parquet_utc_timestamps=true
      Timestamp column is not being filtered on

    Description

      Enabling convert_legacy_hive_parquet_utc_timestamps=true
      makes simple queries that don't even filter on a timestamp attribute perform really poorly.

      Parquet table.
      Impala 2.3 / CDH 5.5.2.

      convert_legacy_hive_parquet_utc_timestamps=true makes following simple query 30x slower (1.1minutes -> over 30 minutes).

      select * from parquet_table_with_a_timestamp_attribute where bigint_attribute=1000771658169

      Notice I did not even filter on a timestamp attribute.

      Made multiple tests with and without convert_legacy_hive_parquet_utc_timestamps=true impalad present.

      Also, from https://issues.cloudera.org/browse/IMPALA-1658

      Casey Ching added a comment - 15/Jun/15 5:12 PM
      Btw, a perf test showed enabling this flag was 10x slower.

      Attachments

        1. screenshot-1.png
          9 kB
          Ruslan Dautkhanov
        2. screenshot-2.png
          19 kB
          Boris Tyukin

        Issue Links

          Activity

            People

              attilaj Attila Jeges
              tagar_impala_e3b3 Ruslan Dautkhanov
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: