Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0, 0.14.0, 1.0.0
Description
Need 2 rows to reproduce this Bug. Here are the steps.
Step 1) Create a table Table_A
CREATE EXTERNAL TABLE Table_A
(
user string
, type int
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS RCFILE
LOCATION '/hive/<path>/Table_A';
Step 2) Scenario: Lets us say consider user tommy belong to both usertypes 111 and 123. Insert 2 records into the table created above.
select * from Table_A;
hive> select * from table_a;
OK
tommy 123 2013-12-02
tommy 111 2013-12-02
Step 3) Create 2 destination tables to simulate multi-insert.
CREATE EXTERNAL TABLE dest_Table_A
(
p_date string
, Distinct_Users int
, Type111Users int
, Type123Users int
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS RCFILE
LOCATION '/hive/<path>/dest_Table_A';
CREATE EXTERNAL TABLE dest_Table_B
(
p_date string
, Distinct_Users int
, Type111Users int
, Type123Users int
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS RCFILE
LOCATION '/hive/<path>/dest_Table_B';
Step 4) Multi insert statement
from Table_A a
INSERT OVERWRITE TABLE dest_Table_A PARTITION(dt='2013-12-02')
select a.dt
,count(distinct a.user) as AllDist
,count(distinct case when a.type = 111 then a.user else null end) as Type111User
,count(distinct case when a.type != 111 then a.user else null end) as Type123User
group by a.dt
INSERT OVERWRITE TABLE dest_Table_B PARTITION(dt='2013-12-02')
select a.dt
,count(distinct a.user) as AllDist
,count(distinct case when a.type = 111 then a.user else null end) as Type111User
,count(distinct case when a.type != 111 then a.user else null end) as Type123User
group by a.dt
;
Step 5) Verify results.
hive> select * from dest_table_a;
OK
2013-12-02 2 1 1 2013-12-02
Time taken: 0.116 seconds
hive> select * from dest_table_b;
OK
2013-12-02 2 1 1 2013-12-02
Time taken: 0.13 seconds
Conclusion: Hive gives a count of 2 for distinct users although there is
only one distinct user. After trying many datasets observed that Hive is doing Type111Users + Typoe123Users = DistinctUsers which is wrong.
hive> select count(distinct a.user) from table_a a;
Gives:
Total MapReduce CPU Time Spent: 4 seconds 350 msec
OK
1
Attachments
Attachments
Issue Links
- relates to
-
HIVE-3728 make optimizing multi-group by configurable
- Closed
- links to