Parquet timestamp columns can contain UTC normalized data, which means that the data is stored in UTC but it is expected to be shown in local time (to be consistent with Hive). This is done by converting these timestamp from UTC to local time during scanning.
This conversion has to be considered during min/max stat filtering, otherwise some row groups can be incorrectly skipped. For this reason
IMPALA-7559 disables stat filtering on UTC normalized timestamp columns.
This ticket deals with creating a correct implementation to be able re-enable stat filtering for these columns.
DST and historical rule changes add some complexity to this. UTC->local mapping can be non-monotonous, and local->UTC mapping can be ambiguous. The non-monotonous mapping means that if tMin <= t <= tMax is true in UTC does not imply that the same is true in local time.
The solution I see is to convert min/max of the predicate from local to UTC and resolve ambiguity by choosing the earlier time in case of min, and the later time in case of max. These UTC values can be compared with stats safely.
Note the timezone rules can be different in Hive and Impala (especially historical ones), so we cannot ensure that Impala gives exactly the same results as Hive. The goal is to ensure that Impala returns the same rows with and without stat filtering.