Affects Version/s: 0.12.0
Fix Version/s: 0.12.0
Suppose that we have a query shown below
When the number of rows of t2 is larger than hive.join.emit.interval, JoinOperator will emit rows from t1, which will result in redundant output.
Let's say t1 is
and t2 is
When hive.join.emit.interval=1, the output of above query will be
The correct result should be
This problem cannot be found in unit test. Because there is a GBY operator inserted before JoinOperator and we have only 1 mapper, the output of map phase only has distinct keys.
Please apply the patch 'wrong_semi_join.txt' attached below and use
to replay the problem. The wrong result can be found in