Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-23438

Missing Rows When Left Outer Join In N-way HybridGraceHashJoin

    XMLWordPrintableJSON

Details

    • Bug
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 2.3.4
    • None
    • SQL, Tez

    Description

      Run Test in Patch File

      mvn test -Dtest=TestMiniTezCliDriver -Dqfile=hybridgrace_hashjoin_2.q

      Manual Reproduce

      STEP 1. Create test data(q_test_init_tez.sql)

      //create table src1
      CREATE TABLE src1 (key STRING COMMENT 'default', value STRING COMMENT 'default') STORED AS TEXTFILE;
      
      LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv3.txt" INTO TABLE src1;
      
      //create table src2
      CREATE TABLE src2(key STRING COMMENT 'default', value STRING COMMENT 'default') STORED AS TEXTFILE;
      
      LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv11.txt" OVERWRITE INTO TABLE src2;
      
      //create table srcpart
      CREATE TABLE srcpart (key STRING COMMENT 'default', value STRING COMMENT 'default')
      PARTITIONED BY (ds STRING, hr STRING)
      STORED AS TEXTFILE;
      
      LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv1.txt"
      OVERWRITE INTO TABLE srcpart PARTITION (ds="2008-04-08", hr="11");
      
      LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv1.txt"
      OVERWRITE INTO TABLE srcpart PARTITION (ds="2008-04-08", hr="12");
      
      LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv1.txt"
      OVERWRITE INTO TABLE srcpart PARTITION (ds="2008-04-09", hr="11");
      
      LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv1.txt"
      OVERWRITE INTO TABLE srcpart PARTITION (ds="2008-04-09", hr="12");

      STEP 2. Run query

      set hive.auto.convert.join=true; 
      set hive.auto.convert.join.noconditionaltask=true; 
      set hive.auto.convert.join.noconditionaltask.size=10000000; 
      set hive.cbo.enable=false;
      set hive.mapjoin.hybridgrace.hashtable=true;
      
      select *
      from
      (
      select key from src1 group by key
      ) x
      left join src2 z on x.key = z.key
      join
      (
      select key from srcpart y group by key
      ) y on y.key = x.key;
      

      EXPECTED RESULT**

       

      128	NULL	NULL	128
      146	146	1val_1461	146
      150	150	1val_1501	150
      238	NULL	NULL	238
      369	NULL	NULL	369
      406	406	1val_4061	406
      273	273	1val_2731	273
      98	NULL	NULL	98
      213	213	1val_2131	213
      255	NULL	NULL	255
      401	401	1val_4011	401
      278	NULL	NULL	278
      66	66	11val_6611	66
      224	NULL	NULL	224
      311	NULL	NULL	311
      

       

      ACTUAL RESULT

      128	NULL	NULL	128
      146	146	1val_1461	146
      150	150	1val_1501	150
      213	213	1val_2131	213
      238	NULL	NULL	238
      273	273	1val_2731	273
      369	NULL	NULL	369
      406	406	1val_4061	406
      98	NULL	NULL	98
      401	401	1val_4011	401
      66	66	11val_6611	66
      

       

      ROOT CAUSE

      src1 left join src2, src1 is big table and src2 is small table. Join result between big table row and the corresponding hashtable maybe NO_MATCH state, however, these NO_MATCH rows is needed because LEFT OUTER JOIN.

      In addition, these big table rows will not spilled into matchfile related to this hashtable on disk because only SPILL state can use `spillBigTableRow`.  Then, these big table rows will be spilled into matchfile in hashtables of table `srcpart`(second small table)

      Finally, when reProcessBigTable, big table rows in matchfile are only read from `firstSmallTable`, some datum are missing.

       

      WORKAROUND

       configure firstSmallTable in completeInitializationOp and only spill big table row into firstSmallTable when spill matchfile.

       

      Attachments

        1. HIVE-23438.branch-2.3.patch
          21 kB
          Yizhen Fan
        2. HIVE-23438.001.branch-2.3.patch
          21 kB
          Yizhen Fan

        Issue Links

          Activity

            People

              sparksfyz Yizhen Fan
              sparksfyz Yizhen Fan
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m