Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
Description
Repro (thanks Qiaosong Dong) -
Add -Duser.timezone=GMT+8 to tez.task.launch.cmd-opts
create external table test_dt(id string, dt date); insert into test_dt values('11', '2021-07-06'), ('22', '2021-07-07'); select datediff(dt1.dt, '2021-07-01') from test_dt dt1 left join test_dt dt on dt1.id = dt.id; +------+ | _c0 | +------+ | 6 | | 7 | +------+
Expected output -
+------+ | _c0 | +------+ | 5 | | 6 | +------+
Cause
This happens because in VectorUDFDateDiffColScalar class
1. For second argument(scalar) , we use java.text.SimpleDateFormat to parse the date strings which interprets it to be in local timezone.
2. For first column we get a column vector which represents the date as epoch day. This is always in UTC.
Solution
We need to check other variants of datediff UDFs as well and change the parsing mechanism to always interpret date strings in UTC.
I did a quick change in VectorUDFDateDiffColScalar which fixes the issue.
- date.setTime(formatter.parse(new String(bytesValue, "UTF-8")).getTime()); - baseDate = DateWritableV2.dateToDays(date); + org.apache.hadoop.hive.common.type.Date hiveDate + = org.apache.hadoop.hive.common.type.Date.valueOf(new String(bytesValue, "UTF-8")); + date.setTime(hiveDate.toEpochMilli()); + baseDate = hiveDate.toEpochDay();
Attachments
Issue Links
- relates to
-
HIVE-12192 Hive should carry out timestamp computations in UTC
- Closed
- links to