Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-17499

Hive Cube Operator returns duplicate rows

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 2.1.0
    • None
    • None
    • None
    • Hortonworks HDP 2.6.0.1

    Description

      The cube Operator returns duplicate rows, when it shouldnt. I ran the same query in Pig and got the correct result. (see the example below)

      insert overwrite table thesis.clickstream_export PARTITION (ds_year = '2016' , ds_month = '04' , ds_day = '01')
      select year(ds), month(ds), day(ds), c8, c11, count(*)
      from thesis.clickstream_landing
      where ds = '2016-04-01'
      group by year(ds), month(ds), day(ds), c8, c11
      With Cube;
      

      Then I check for duplicates:

      select year, month, day, country, city, count (*) from thesis.clickstream_export
      where ds_year = '2016' and ds_month = '04' and ds_day = '01'
      group by year, month, day, country, city
      having count(*) > 1;
      

      The result is:
      year month day country city _c5
      null null null null null 4
      null null 1 null null 4
      null 4 null null null 4
      null 4 1 null null 4
      2016 null null null null 4
      2016 null 1 null null 4
      2016 4 null null null 4
      2016 4 1 null null 4

      When i do the same thing in Pig, everything is fine:

      DATA = LOAD 'thesis.clickstream_landing' USING org.apache.hive.hcatalog.pig.HCatLoader();
      
      FILTERED = FOREACH DATA GENERATE GetYear(ToDate(ds, 'yyyy-MM-dd')) AS year, GetMonth(ToDate(ds, 'yyyy-MM-dd')) AS month, GetDay(ToDate(ds, 'yyyy-MM-dd')) AS day, c8 AS country, c11 AS city;
      
      CUBED = CUBE FILTERED BY CUBE(year, month, day, country, city);
      
      D = FOREACH CUBED GENERATE FLATTEN(group) AS (year, month, day, country, city), COUNT_STAR(cube) As click_count;
      
      STORE D INTO 'thesis.clickstream_export' USING org.apache.hive.hcatalog.pig.HCatStorer('ds_year=2016, ds_month=04, ds_day=02');
      

      Then again I check for duplicates:

      select year, month, day, country, city, count (*) from thesis.clickstream_export
      where ds_year = '2016' and ds_month = '04' and ds_day = '02'
      group by year, month, day, country, city
      having count(*) > 1;
      

      And the result is empty as it should be.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            joha0123 Johannes Mayer

            Dates

              Created:
              Updated:

              Slack

                Issue deployment