Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-12412

Multi insert queries fail to run properly in hive 1.1.x or later.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.1.0, 1.2.0, 2.3.0
    • None
    • None

    Description

      We use multi insert queries to take data in one table and manipulate it by inserting it into a results table. Queries are of this form:

      from (select * from data_table lateral view explode(data_table.f2) f2 as explode_f2) as explode_data_table
      insert overwrite table results_table partition (q_id='C.P1',rl='1')
      select
      array(cast(if(explode_data_table.f1 is null or explode_data_table.f1='', 'UNKNOWN',explode_data_table.f1) as String),cast(explode_f2.s1 as String)) as dimensions,
      ARRAY(CAST(sum(explode_f2.d1) as Double)) as metrics,
      null as rownm
      where (explode_data_table.date_id between 20151016 and 20151016)
      group by
      if(explode_data_table.f1 is null or explode_data_table.f1='', 'UNKNOWN',explode_data_table.f1),
      explode_f2.s1
      INSERT OVERWRITE TABLE results_table PARTITION (q_id='C.P2',rl='0')
      SELECT ARRAY(CAST('Total' as String),CAST('Total' as String)) AS dimensions,
      ARRAY(CAST(sum(explode_f2.d1) as Double)) AS metrics,
      null AS rownm
      WHERE (explode_data_table.date_id BETWEEN 20151016 AND 20151016)
      INSERT OVERWRITE TABLE results_table PARTITION (q_id='C.P5',rl='0')
      SELECT
      ARRAY(CAST('Total' as String)) AS dimensions,
      ARRAY(CAST(sum(explode_f2.d1) as Double)) AS metrics,
      null AS rownm
      WHERE (explode_data_table.date_id BETWEEN 20151016 AND 20151016)

      This query is meant to total a given field of a struct that is potentially a list of structs. For our test data set, which consists of a single row, the summation yields "Null", with messages in the hive log of the nature:

      Missing fields! Expected 2 fields but only got 1! Ignoring similar problems.
      or "Extra fields detected..."

      For significantly more data, this query will eventually cause a run time error while processing a column (caused by array index out of bounds exception in one of the lazy binary classes such as LazyBinaryString or LazyBinaryStruct).

      Using the query above from the hive command line, the following data was used:
      (note there are tabs in the data below)

      string one one:1.0:1.00:10.0,eon:1.0:1.00:100.0
      string two two:2.0:2.00:20.0,otw:2.0:2.00:20.0,wott:2.0:2.00:20.0
      string thr three:3.0:3.00:30.0
      string fou four:4.0:4.00:40.0

      There are two fields, a string, (eg. 'string one') and a list of structs. The following is used to create the table:

      create table if not exists t1 (
      f1 string,
      f2 array<struct<s1:string,d1:double,d2:double,d3:double>>
      )
      partitioned by (clid string, date_id string)
      row format delimited fields
      terminated by '09'
      collection items terminated by ','
      map keys terminated by ':'
      lines terminated by '10'
      location '/user/hive/warehouse/t1';

      And the following is used to load the data:

      load data local inpath '/path/to/data/file/cplx_test.data2' OVERWRITE into table t1 partition(client_id='987654321',date_id='20151016');

      The resulting table should yield the following:
      ["string fou","four"] [4.0] null C.P1 1
      ["string one","eon"] [1.0] null C.P1 1
      ["string one","one"] [1.0] null C.P1 1
      ["string thr","three"] [3.0] null C.P1 1
      ["string two","otw"] [2.0] null C.P1 1
      ["string two","two"] [2.0] null C.P1 1
      ["string two","wott"] [2.0] null C.P1 1
      ["Total","Total"] [15.0] null C.P2 0
      ["Total"] [15.0] null C.P5 0

      However what we get is:
      Hive Runtime Error while processing row

      {"_col2":2.5306499719322744E-258,"_col3":""}

      (ultimately due to an array index out of bounds exception)

      If we reduce the above data to a SINGLE row, the we don't get an exception but the total fields come out as NULL.

      The ONLY way this query would work is
      1) if I added a group by (date_id) or even group by ('') as the last line in the query... or removed the last where clause for the final insert. (The reason why we have a where clause on the initial select is due to push down predicates not working... and causing a complete scan of the entire table).

      or
      2) removing the group by clauses entirely

      or using
      3) hive.multigroupby.singlereducer = false

      Once 3) is used no modification of the query is needed... it all runs. This is very similar in nature to the following:
      https://issues.apache.org/jira/browse/HIVE-2750
      which is supposedly fixed and closed.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jpetrakis John P. Petrakis
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: