Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.3.4, 3.0.1
-
None
-
None
-
Hortonworks (HDP3)
- Hadoop 3.1.1.3.0.1.0-187
- Hive 3.1.0
- Tez 0.9.1
EMR
- Hadoop 2.8.5-amzn-1
- Hive 2.3.4
- Tez 0.9.1
Description
Hive table query with a UDF call, shows multiple calls to the UDF. The UDF is being called to as many times as the number of fields in sub-select statement.
This behavior leads to performance issues as well as it also gives incorrect results while collecting the duration or timings of any operation.
To show the problem TimestampUDF (sample code has been attached) has been created which returns three fields namely start_time (string), end_time (string) and duration_milli (long). To keep the things simple, fixed 2 second delay has been added. In the TimestampUDF duration_mill is calculated as (end_time - start_time).
The expectation is that duration_milli should be equal to (end_time - start_time)
I have tested following Hive Queries:
- Select T.Candidate.startTime, T.Candidate.endTime, T.Candidate.duration_milli from (Select TimestampUDF() as Candidate) T;
- Select T.Candidate.startTime, T.Candidate.endTime, T.Candidate.duration_milli from (Select TimestampUDF() as Candidate from idtable) T;
Note
"idtable" is having only one entry and has been created via followig hive queries
create table if not exists idTable (RowNum int);
insert overwrite table idTable values (1);
In Hive query (1)
I am getting correct values because TimestampUDF is being called once and if we calculate the (end_time - start_time) manually and compare it with returned duration_milli then they both are same.
However, in Hive query (2)
duration_milli is not equal to (end_time - start_time) because the TimestampUDF is being called thrice (the number of fields in sub-select query) and also
- start_time is from 1st call of TimestampUDF
- end_time is from 2nd call of TimestampUDF
- duration_milli is from 3rd call of TimestampUDF
I got following results from hive query (2), same values can be validated in attached logs file of respective platform
EMR
---------------------------------------------------++----------------
+
starttime | endtime | duration_milli |
--------------------------+++------------------------------------------
2019-04-16T11:53:13.595Z | 2019-04-16T11:53:17.619Z | 2000 |
---------------------------------------------------++----------------
HDP3
---------------------------------------------------++----------------
starttime | endtime | duration_milli |
---------------------------------------------------++----------------
2019-04-16T12:37:33.593Z | 2019-04-16T12:37:37.597Z | 2000 |
---------------------------------------------------++----------------