[HIVE-6099] Multi insert does not work properly with distinct count - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0, 0.14.0, 1.0.0
Fix Version/s: 1.2.0
Component/s: Query Processor
Labels:
- TODOC1.2
- count
- distinct
- insert
- multi-insert

Description

Need 2 rows to reproduce this Bug. Here are the steps.

Step 1) Create a table Table_A
CREATE EXTERNAL TABLE Table_A
(
user string
, type int
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS RCFILE
LOCATION '/hive/<path>/Table_A';

Step 2) Scenario: Lets us say consider user tommy belong to both usertypes 111 and 123. Insert 2 records into the table created above.

select * from Table_A;

hive> select * from table_a;
OK
tommy 123 2013-12-02
tommy 111 2013-12-02

Step 3) Create 2 destination tables to simulate multi-insert.
CREATE EXTERNAL TABLE dest_Table_A
(
p_date string
, Distinct_Users int
, Type111Users int
, Type123Users int
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS RCFILE
LOCATION '/hive/<path>/dest_Table_A';

CREATE EXTERNAL TABLE dest_Table_B
(
p_date string
, Distinct_Users int
, Type111Users int
, Type123Users int
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS RCFILE
LOCATION '/hive/<path>/dest_Table_B';

Step 4) Multi insert statement
from Table_A a
INSERT OVERWRITE TABLE dest_Table_A PARTITION(dt='2013-12-02')
select a.dt
,count(distinct a.user) as AllDist
,count(distinct case when a.type = 111 then a.user else null end) as Type111User
,count(distinct case when a.type != 111 then a.user else null end) as Type123User
group by a.dt

INSERT OVERWRITE TABLE dest_Table_B PARTITION(dt='2013-12-02')
select a.dt
,count(distinct a.user) as AllDist
,count(distinct case when a.type = 111 then a.user else null end) as Type111User
,count(distinct case when a.type != 111 then a.user else null end) as Type123User
group by a.dt
;

Step 5) Verify results.
hive> select * from dest_table_a;
OK
2013-12-02 2 1 1 2013-12-02
Time taken: 0.116 seconds
hive> select * from dest_table_b;
OK
2013-12-02 2 1 1 2013-12-02
Time taken: 0.13 seconds

Conclusion: Hive gives a count of 2 for distinct users although there is
only one distinct user. After trying many datasets observed that Hive is doing Type111Users + Typoe123Users = DistinctUsers which is wrong.

hive> select count(distinct a.user) from table_a a;
Gives:
Total MapReduce CPU Time Spent: 4 seconds 350 msec
OK
1

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

with_enabled.txt
22/Jan/14 01:17
9 kB
Navis Ryu
with_disabled.txt
22/Jan/14 01:17
6 kB
Navis Ryu
HIVE-6099.patch
03/Feb/15 18:14
7 kB
Ashutosh Chauhan
HIVE-6099.4.patch
05/Feb/15 06:05
244 kB
Ashutosh Chauhan
HIVE-6099.3.patch
04/Feb/15 20:05
242 kB
Ashutosh Chauhan
HIVE-6099.2.patch
04/Feb/15 02:50
242 kB
Ashutosh Chauhan
HIVE-6099.1.patch
04/Feb/15 02:03
229 kB
Ashutosh Chauhan
explain_hive_0.10.0.txt
16/Jan/14 06:59
9 kB
Pavan Gadam Manohar

Issue Links

relates to

HIVE-3728 make optimizing multi-group by configurable

Closed

links to

RB request

Activity

People

Assignee:: Ashutosh Chauhan

Reporter:: Pavan Gadam Manohar

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 23/Dec/13 19:46

Updated:: 31/Aug/15 18:20

Resolved:: 06/Feb/15 01:49