Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.15.0
-
None
-
None
-
None
Description
There seems to be a difference in results between regular LEFT JOIN and replicated LEFT JOIN. This may be a case only with very small data sets, as we're using piece of code shown below in production with correct results.
EDIT:
This issue only occurs when running PIG on Tez. (We're using Tez 7.0).
Example:
I have two data sets:
first_period_users:
(108,11,all_users,all_users) (108,13,all_users,all_users) (108,17,all_users,all_users) (138,11,all_users,all_users)
second_period_users:
(108,11,all_users,all_users) (108,13,all_users,all_users)
When I use regular LEFT JOIN on these two I get the correct output:
joined_periods_users = JOIN $first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT, $second_period_users BY (user_id, gg_id, dimension_name, dimension_value);
output:
(108,11,all_users,all_users,108,11,all_users,all_users) (138,11,all_users,all_users,,,,) (108,13,all_users,all_users,108,13,all_users,all_users) (108,17,all_users,all_users,,,,)
BUT, if I add USING 'replicated', the result is completely different:
$joined_periods_users = JOIN
$first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT,
$second_period_users BY (user_id, gg_id, dimension_name, dimension_value)
USING 'replicated';
output:
(108,11,all_users,all_users,108,11,all_users,all_users) (108,11,all_users,all_users,108,11,all_users,all_users) (108,11,all_users,all_users,108,11,all_users,all_users) (108,11,all_users,all_users,108,11,all_users,all_users) (108,11,all_users,all_users,108,11,all_users,all_users) (108,11,all_users,all_users,108,11,all_users,all_users) (108,11,all_users,all_users,108,11,all_users,all_users) (108,13,all_users,all_users,108,13,all_users,all_users) (108,13,all_users,all_users,108,13,all_users,all_users) (108,13,all_users,all_users,108,13,all_users,all_users) (108,13,all_users,all_users,108,13,all_users,all_users) (108,13,all_users,all_users,108,13,all_users,all_users) (108,13,all_users,all_users,108,13,all_users,all_users) (108,13,all_users,all_users,108,13,all_users,all_users) (108,17,all_users,all_users,,,,) (138,11,all_users,all_users,,,,)