Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-21618

Hive table query with a UDF call, shows multiple calls to the UDF.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.3.4, 3.0.1
    • None
    • Tez
    • None
    • Hortonworks (HDP3)

      • Hadoop 3.1.1.3.0.1.0-187
      • Hive 3.1.0
      • Tez 0.9.1

       

      EMR

      • Hadoop 2.8.5-amzn-1
      • Hive 2.3.4
      • Tez 0.9.1

    Description

      Hive table query with a UDF call, shows multiple calls to the UDF. The UDF is being called to as many times as the number of fields in sub-select statement.

       

      This behavior leads to performance issues as well as it also gives incorrect results while collecting the duration or timings of any operation.

      To show the problem TimestampUDF (sample code has been attached) has been created which returns three fields namely start_time (string), end_time (string) and duration_milli (long). To keep the things simple, fixed 2 second delay has been added. In the TimestampUDF duration_mill is calculated as (end_time - start_time).

      The expectation is that duration_milli should be equal to (end_time - start_time)

       

      I have tested following Hive Queries:

      1. Select T.Candidate.startTime, T.Candidate.endTime, T.Candidate.duration_milli from (Select TimestampUDF() as Candidate) T;
      2. Select T.Candidate.startTime, T.Candidate.endTime, T.Candidate.duration_milli from (Select TimestampUDF() as Candidate from idtable) T;

       

      Note

      "idtable" is having only one entry and has been created via followig hive queries

      create table if not exists idTable (RowNum int);
      insert overwrite table idTable values (1);

       

      In Hive query (1)                              

      I am getting correct values because TimestampUDF is being called once and if  we calculate the  (end_time - start_time) manually and compare it with returned duration_milli then they both are same.

       

      However, in Hive query (2)          

      duration_milli is not equal to (end_time - start_time) because the TimestampUDF is being called thrice (the number of fields in sub-select query) and also

      • start_time is from 1st   call of TimestampUDF
      • end_time is from 2nd call of TimestampUDF
      • duration_milli is from 3rd call of TimestampUDF

       

      I got following results  from hive query (2), same values can be validated in attached logs file of respective platform

       

      EMR 

      ---------------------------------------------------++----------------

      +

      starttime endtime duration_milli

      --------------------------+++------------------------------------------

      2019-04-16T11:53:13.595Z 2019-04-16T11:53:17.619Z 2000

      ---------------------------------------------------++----------------

       

      HDP3

      ---------------------------------------------------++----------------

      starttime endtime duration_milli

      ---------------------------------------------------++----------------

      2019-04-16T12:37:33.593Z 2019-04-16T12:37:37.597Z 2000

      ---------------------------------------------------++----------------

       

       

      Attachments

        1. TimestampUDF.java
          3 kB
          Rajveer Singh
        2. hdp_3_cluster_timestampUDF.log
          213 kB
          Rajveer Singh
        3. emr_19_cluster_timestampUDF.log
          232 kB
          Rajveer Singh

        Activity

          People

            Unassigned Unassigned
            rajveerscse Rajveer Singh
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: