Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-4781

LEFT SEMI JOIN generates wrong results when the number of rows belonging to a single key of the right table exceed hive.join.emit.interval

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.12.0
    • 0.12.0
    • None
    • None

    Description

      Suppose that we have a query shown below

      SELECT key FROM t1 LEFT SEMI JOIN t2 ON (t1.key=t2.key);
      

      When the number of rows of t2 is larger than hive.join.emit.interval, JoinOperator will emit rows from t1, which will result in redundant output.

      Let's say t1 is

      1
      

      and t2 is

      1
      1
      1
      1
      

      When hive.join.emit.interval=1, the output of above query will be

      1
      1
      1
      1
      

      The correct result should be

      1
      

      This problem cannot be found in unit test. Because there is a GBY operator inserted before JoinOperator and we have only 1 mapper, the output of map phase only has distinct keys.

      Please apply the patch 'wrong_semi_join.txt' attached below and use

      ant test -Dtestcase=TestMinimrCliDriver -Dqfile="left_semi_join.q" -Dtest.silent=false
      

      to replay the problem. The wrong result can be found in

      <hive_root_dir>/build/ql/test/logs/clientpositive
      

      Attachments

        1. wrong_semi_join.txt
          3 kB
          Yin Huai
        2. HIVE-4781.txt
          7 kB
          Yin Huai

        Issue Links

          Activity

            People

              yhuai Yin Huai
              yhuai Yin Huai
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: