[IMPALA-3316] convert_legacy_hive_parquet_utc_timestamps=true makes reading parquet tables 30x slower - ASF JIRA

XML

Word

Printable

JSON

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: impala 2.3
Fix Version/s: Impala 3.1.0
Component/s: Backend
Labels:
None
Environment:
CDH 5.5.2/ Impala 2.3
Parquet table with a timestamp column
Secure cluster
convert_legacy_hive_parquet_utc_timestamps=true
Timestamp column is not being filtered on

Enabling convert_legacy_hive_parquet_utc_timestamps=true
makes simple queries that don't even filter on a timestamp attribute perform really poorly.

Parquet table.
Impala 2.3 / CDH 5.5.2.

convert_legacy_hive_parquet_utc_timestamps=true makes following simple query 30x slower (1.1minutes -> over 30 minutes).

select * from parquet_table_with_a_timestamp_attribute where bigint_attribute=1000771658169

Notice I did not even filter on a timestamp attribute.

Made multiple tests with and without convert_legacy_hive_parquet_utc_timestamps=true impalad present.

Casey Ching added a comment - 15/Jun/15 5:12 PM
Btw, a perf test showed enabling this flag was 10x slower.

relates to

IMPALA-3307 add support for IANA time zone database

IMPALA-1773 Implement TIMESTAMP WITH TIME ZONE data type

IMPALA-2125 Improve perf when reading timestamps from parquet files written by hive

IMPALA-1658 Add a compatibility option for reading parquet timestamps written by Hive

links to

Google CCTZ c++ library

(1 links to)