[IMPALA-7568] Implement timezone aware parquet stat filtering for timestamp columns - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Implemented
Affects Version/s: None
Fix Version/s: Impala 3.2.0
Component/s: Backend
Labels:
- parquet
- timestamp

Epic Color:
ghx-label-3

Description

Parquet timestamp columns can contain UTC normalized data, which means that the data is stored in UTC but it is expected to be shown in local time (to be consistent with Hive). This is done by converting these timestamp from UTC to local time during scanning.

This conversion has to be considered during min/max stat filtering, otherwise some row groups can be incorrectly skipped. For this reason ~~IMPALA-7559~~ disables stat filtering on UTC normalized timestamp columns.

This ticket deals with creating a correct implementation to be able re-enable stat filtering for these columns.

DST and historical rule changes add some complexity to this. UTC->local mapping can be non-monotonous, and local->UTC mapping can be ambiguous. The non-monotonous mapping means that if tMin <= t <= tMax is true in UTC does not imply that the same is true in local time.

The solution I see is to convert min/max of the predicate from local to UTC and resolve ambiguity by choosing the earlier time in case of min, and the later time in case of max. These UTC values can be compared with stats safely.

Note the timezone rules can be different in Hive and Impala (especially historical ones), so we cannot ensure that Impala gives exactly the same results as Hive. The goal is to ensure that Impala returns the same rows with and without stat filtering.

Attachments

Issue Links

is duplicated by

IMPALA-7567 Implement timezone aware parquet stat filtering for timestamp columns

Closed

is part of

IMPALA-5050 Add support to read TIMESTAMP_MILLIS and TIMESTAMP_MICROS to the parquet scanner

Resolved

is related to

IMPALA-7559 Parquet stat filtering ignores convert_legacy_hive_parquet_utc_timestamps

Resolved

IMPALA-5050 Add support to read TIMESTAMP_MILLIS and TIMESTAMP_MICROS to the parquet scanner

Resolved

Activity

People

Assignee:: Csaba Ringhofer

Reporter:: Csaba Ringhofer

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 13/Sep/18 12:57

Updated:: 20/Nov/18 15:14

Resolved:: 20/Nov/18 15:14