Details
-
Task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
Impala 3.0
-
None
-
None
-
ghx-label-4
Description
As pointed out in https://gerrit.cloudera.org/#/c/11731 by csringhofer, our support for the ORC file format doesn't follow the same timezone conventions as the rest of Impala.
tldr: ORC's timezone handling is likely to be broken in Impala so we should patch it in the toolchain
The ORC library implements its own IANA timezone handling to convert stored timestamps from UTC to local time + do something similar for min/max stats. The writer's timezone can be also stored in .orc files and used instead of local timezone.
Impala's and ORC library's timezone can be different because of several reasons:
ORC's timezone is not overridden by env var TZ and query option timezone
ORC uses a simpler way to detect the local timezone which may not work on some Linux distros (see TimezoneDatabase::LocalZoneName in Impala vs LOCAL_TIMEZONE in Orc)
.orc files can use any time zone as writer's timezone and we cannot be sure that it will exist on the reader machine
My suggestion is to patch the ORC library in the toolchain and remove timezone handling (e.g. by always using UTC, maybe depending on a flag), as the way it is currently working is likely to be broken and is surely not consistent with the rest of Impala.I am not sure how timezones could be handled correctly in Orc + Impala. If someone plans to work on it, I would gladly help in the integration to Impala.