[HIVE-17499] Hive Cube Operator returns duplicate rows - ASF JIRA

Log work

Agile Board

Rank to Top

Rank to Bottom

Bulk Copy Attachments

Bulk Move Attachments

Add vote

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Move

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: 2.1.0
Fix Version/s: None
Component/s: None
Labels:
None
Environment:

Hortonworks HDP 2.6.0.1

Description

The cube Operator returns duplicate rows, when it shouldnt. I ran the same query in Pig and got the correct result. (see the example below)

insert overwrite table thesis.clickstream_export PARTITION (ds_year = '2016' , ds_month = '04' , ds_day = '01')
select year(ds), month(ds), day(ds), c8, c11, count(*)
from thesis.clickstream_landing
where ds = '2016-04-01'
group by year(ds), month(ds), day(ds), c8, c11
With Cube;

Then I check for duplicates:

select year, month, day, country, city, count (*) from thesis.clickstream_export
where ds_year = '2016' and ds_month = '04' and ds_day = '01'
group by year, month, day, country, city
having count(*) > 1;

The result is:
year month day country city _c5
null null null null null 4
null null 1 null null 4
null 4 null null null 4
null 4 1 null null 4
2016 null null null null 4
2016 null 1 null null 4
2016 4 null null null 4
2016 4 1 null null 4

When i do the same thing in Pig, everything is fine:

DATA = LOAD 'thesis.clickstream_landing' USING org.apache.hive.hcatalog.pig.HCatLoader();

FILTERED = FOREACH DATA GENERATE GetYear(ToDate(ds, 'yyyy-MM-dd')) AS year, GetMonth(ToDate(ds, 'yyyy-MM-dd')) AS month, GetDay(ToDate(ds, 'yyyy-MM-dd')) AS day, c8 AS country, c11 AS city;

CUBED = CUBE FILTERED BY CUBE(year, month, day, country, city);

D = FOREACH CUBED GENERATE FLATTEN(group) AS (year, month, day, country, city), COUNT_STAR(cube) As click_count;

STORE D INTO 'thesis.clickstream_export' USING org.apache.hive.hcatalog.pig.HCatStorer('ds_year=2016, ds_month=04, ds_day=02');

Then again I check for duplicates:

select year, month, day, country, city, count (*) from thesis.clickstream_export
where ds_year = '2016' and ds_month = '04' and ds_day = '02'
group by year, month, day, country, city
having count(*) > 1;