[HIVE-12412] Multi insert queries fail to run properly in hive 1.1.x or later. - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.1.0, 1.2.0, 2.3.0
Fix Version/s: None
Component/s: None
Labels:
- Correctness
- CorrectnessBug

Description

We use multi insert queries to take data in one table and manipulate it by inserting it into a results table. Queries are of this form:

from (select * from data_table lateral view explode(data_table.f2) f2 as explode_f2) as explode_data_table
insert overwrite table results_table partition (q_id='C.P1',rl='1')
select
array(cast(if(explode_data_table.f1 is null or explode_data_table.f1='', 'UNKNOWN',explode_data_table.f1) as String),cast(explode_f2.s1 as String)) as dimensions,
ARRAY(CAST(sum(explode_f2.d1) as Double)) as metrics,
null as rownm
where (explode_data_table.date_id between 20151016 and 20151016)
group by
if(explode_data_table.f1 is null or explode_data_table.f1='', 'UNKNOWN',explode_data_table.f1),
explode_f2.s1
INSERT OVERWRITE TABLE results_table PARTITION (q_id='C.P2',rl='0')
SELECT ARRAY(CAST('Total' as String),CAST('Total' as String)) AS dimensions,
ARRAY(CAST(sum(explode_f2.d1) as Double)) AS metrics,
null AS rownm
WHERE (explode_data_table.date_id BETWEEN 20151016 AND 20151016)
INSERT OVERWRITE TABLE results_table PARTITION (q_id='C.P5',rl='0')
SELECT
ARRAY(CAST('Total' as String)) AS dimensions,
ARRAY(CAST(sum(explode_f2.d1) as Double)) AS metrics,
null AS rownm
WHERE (explode_data_table.date_id BETWEEN 20151016 AND 20151016)

This query is meant to total a given field of a struct that is potentially a list of structs. For our test data set, which consists of a single row, the summation yields "Null", with messages in the hive log of the nature:

Missing fields! Expected 2 fields but only got 1! Ignoring similar problems.
or "Extra fields detected..."

For significantly more data, this query will eventually cause a run time error while processing a column (caused by array index out of bounds exception in one of the lazy binary classes such as LazyBinaryString or LazyBinaryStruct).

Using the query above from the hive command line, the following data was used:
(note there are tabs in the data below)

string one one:1.0:1.00:10.0,eon:1.0:1.00:100.0
string two two:2.0:2.00:20.0,otw:2.0:2.00:20.0,wott:2.0:2.00:20.0
string thr three:3.0:3.00:30.0
string fou four:4.0:4.00:40.0

There are two fields, a string, (eg. 'string one') and a list of structs. The following is used to create the table:

create table if not exists t1 (
f1 string,
f2 array<struct<s1:string,d1:double,d2:double,d3:double>>
)
partitioned by (clid string, date_id string)
row format delimited fields
terminated by '09'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '10'
location '/user/hive/warehouse/t1';

And the following is used to load the data:

load data local inpath '/path/to/data/file/cplx_test.data2' OVERWRITE into table t1 partition(client_id='987654321',date_id='20151016');

The resulting table should yield the following:
["string fou","four"] [4.0] null C.P1 1
["string one","eon"] [1.0] null C.P1 1
["string one","one"] [1.0] null C.P1 1
["string thr","three"] [3.0] null C.P1 1
["string two","otw"] [2.0] null C.P1 1
["string two","two"] [2.0] null C.P1 1
["string two","wott"] [2.0] null C.P1 1
["Total","Total"] [15.0] null C.P2 0
["Total"] [15.0] null C.P5 0

However what we get is:
Hive Runtime Error while processing row

{"_col2":2.5306499719322744E-258,"_col3":""}

(ultimately due to an array index out of bounds exception)

If we reduce the above data to a SINGLE row, the we don't get an exception but the total fields come out as NULL.

The ONLY way this query would work is
1) if I added a group by (date_id) or even group by ('') as the last line in the query... or removed the last where clause for the final insert. (The reason why we have a where clause on the initial select is due to push down predicates not working... and causing a complete scan of the entire table).

or
2) removing the group by clauses entirely

or using
3) hive.multigroupby.singlereducer = false

Once 3) is used no modification of the query is needed... it all runs. This is very similar in nature to the following:
https://issues.apache.org/jira/browse/HIVE-2750
which is supposedly fixed and closed.

Attachments

Issue Links

Blocked

HIVE-2750 Hive multi group by single reducer optimization causes invalid column reference error

Closed

Multi insert queries fail to run properly in hive 1.1.x or later.

Details

Description

Attachments

Issue Links

Activity

People

Dates