Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-16965

SMB join may produce incorrect results

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 3.0.0
    • None
    • None

    Description

      Running the following on MiniTez

      set hive.mapred.mode=nonstrict;
      SET hive.vectorized.execution.enabled=true;
      SET hive.exec.orc.default.buffer.size=32768;
      SET hive.exec.orc.default.row.index.stride=1000;
      SET hive.optimize.index.filter=true;
      set hive.fetch.task.conversion=none;
      set hive.exec.dynamic.partition.mode=nonstrict;
      
      DROP TABLE orc_a;
      DROP TABLE orc_b;
      
      CREATE TABLE orc_a (id bigint, cdouble double) partitioned by (y int, q smallint)
        CLUSTERED BY (id) SORTED BY (id) INTO 2 BUCKETS stored as orc;
      CREATE TABLE orc_b (id bigint, cfloat float)
        CLUSTERED BY (id) SORTED BY (id) INTO 2 BUCKETS stored as orc;
      
      insert into table orc_a partition (y=2000, q)
      select cbigint, cdouble, csmallint % 10 from alltypesorc
        where cbigint is not null and csmallint > 0 order by cbigint asc;
      insert into table orc_a partition (y=2001, q)
      select cbigint, cdouble, csmallint % 10 from alltypesorc
        where cbigint is not null and csmallint > 0 order by cbigint asc;
      
      insert into table orc_b 
      select cbigint, cfloat from alltypesorc
        where cbigint is not null and csmallint > 0 order by cbigint asc limit 200;
      
      set hive.cbo.enable=false;
      
      select y,q,count(*) from orc_a a join orc_b b on a.id=b.id group by y,q;
      
      set hive.enforce.sortmergebucketmapjoin=false;
      set hive.optimize.bucketmapjoin=true;
      set hive.optimize.bucketmapjoin.sortedmerge=true;
      set hive.auto.convert.sortmerge.join=true;
      set hive.auto.convert.join=true;
      set hive.auto.convert.join.noconditionaltask.size=10;
      
      explain
      select y,q,count(*) from orc_a a join orc_b b on a.id=b.id group by y,q;
      select y,q,count(*) from orc_a a join orc_b b on a.id=b.id group by y,q;
      
      DROP TABLE orc_a;
      DROP TABLE orc_b;
      

      Produces different results for the two selects. The SMB one looks incorrect. cc djaiswal hagleitn

      Attachments

        1. HIVE-16965.8.patch
          7 kB
          Deepak Jaiswal
        2. HIVE-16965.7.patch
          7 kB
          Deepak Jaiswal
        3. HIVE-16965.6.patch
          7 kB
          Deepak Jaiswal
        4. HIVE-16965.5.patch
          6 kB
          Deepak Jaiswal
        5. HIVE-16965.4.patch
          6 kB
          Deepak Jaiswal
        6. HIVE-16965.3.patch
          36 kB
          Deepak Jaiswal
        7. HIVE-16965.2.patch
          50 kB
          Deepak Jaiswal
        8. HIVE-16965.1.patch
          4 kB
          Deepak Jaiswal

        Issue Links

          Activity

            People

              djaiswal Deepak Jaiswal
              sershe Sergey Shelukhin
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: