Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-6099

Multi insert does not work properly with distinct count

    XMLWordPrintableJSON

Details

    Description

      Need 2 rows to reproduce this Bug. Here are the steps.

      Step 1) Create a table Table_A
      CREATE EXTERNAL TABLE Table_A
      (
      user string
      , type int
      )
      PARTITIONED BY (dt string)
      ROW FORMAT DELIMITED
      FIELDS TERMINATED BY '|'
      STORED AS RCFILE
      LOCATION '/hive/<path>/Table_A';

      Step 2) Scenario: Lets us say consider user tommy belong to both usertypes 111 and 123. Insert 2 records into the table created above.

      select * from Table_A;

      hive> select * from table_a;
      OK
      tommy 123 2013-12-02
      tommy 111 2013-12-02

      Step 3) Create 2 destination tables to simulate multi-insert.
      CREATE EXTERNAL TABLE dest_Table_A
      (
      p_date string
      , Distinct_Users int
      , Type111Users int
      , Type123Users int
      )
      PARTITIONED BY (dt string)
      ROW FORMAT DELIMITED
      FIELDS TERMINATED BY '|'
      STORED AS RCFILE
      LOCATION '/hive/<path>/dest_Table_A';

      CREATE EXTERNAL TABLE dest_Table_B
      (
      p_date string
      , Distinct_Users int
      , Type111Users int
      , Type123Users int
      )
      PARTITIONED BY (dt string)
      ROW FORMAT DELIMITED
      FIELDS TERMINATED BY '|'
      STORED AS RCFILE
      LOCATION '/hive/<path>/dest_Table_B';

      Step 4) Multi insert statement
      from Table_A a
      INSERT OVERWRITE TABLE dest_Table_A PARTITION(dt='2013-12-02')
      select a.dt
      ,count(distinct a.user) as AllDist
      ,count(distinct case when a.type = 111 then a.user else null end) as Type111User
      ,count(distinct case when a.type != 111 then a.user else null end) as Type123User
      group by a.dt

      INSERT OVERWRITE TABLE dest_Table_B PARTITION(dt='2013-12-02')
      select a.dt
      ,count(distinct a.user) as AllDist
      ,count(distinct case when a.type = 111 then a.user else null end) as Type111User
      ,count(distinct case when a.type != 111 then a.user else null end) as Type123User
      group by a.dt
      ;

      Step 5) Verify results.
      hive> select * from dest_table_a;
      OK
      2013-12-02 2 1 1 2013-12-02
      Time taken: 0.116 seconds
      hive> select * from dest_table_b;
      OK
      2013-12-02 2 1 1 2013-12-02
      Time taken: 0.13 seconds

      Conclusion: Hive gives a count of 2 for distinct users although there is
      only one distinct user. After trying many datasets observed that Hive is doing Type111Users + Typoe123Users = DistinctUsers which is wrong.

      hive> select count(distinct a.user) from table_a a;
      Gives:
      Total MapReduce CPU Time Spent: 4 seconds 350 msec
      OK
      1

      Attachments

        1. with_enabled.txt
          9 kB
          Navis Ryu
        2. with_disabled.txt
          6 kB
          Navis Ryu
        3. HIVE-6099.patch
          7 kB
          Ashutosh Chauhan
        4. HIVE-6099.4.patch
          244 kB
          Ashutosh Chauhan
        5. HIVE-6099.3.patch
          242 kB
          Ashutosh Chauhan
        6. HIVE-6099.2.patch
          242 kB
          Ashutosh Chauhan
        7. HIVE-6099.1.patch
          229 kB
          Ashutosh Chauhan
        8. explain_hive_0.10.0.txt
          9 kB
          Pavan Gadam Manohar

        Issue Links

          Activity

            People

              ashutoshc Ashutosh Chauhan
              pavangm Pavan Gadam Manohar
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: