Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4695

Using 'replicated' left join results in different result from regular left join.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.15.0
    • None
    • None
    • None

    Description

      There seems to be a difference in results between regular LEFT JOIN and replicated LEFT JOIN. This may be a case only with very small data sets, as we're using piece of code shown below in production with correct results.
      EDIT:
      This issue only occurs when running PIG on Tez. (We're using Tez 7.0).

      Example:
      I have two data sets:

      first_period_users:

      (108,11,all_users,all_users)
      (108,13,all_users,all_users)
      (108,17,all_users,all_users)
      (138,11,all_users,all_users)
      

      second_period_users:

      (108,11,all_users,all_users)
      (108,13,all_users,all_users)
      

      When I use regular LEFT JOIN on these two I get the correct output:

      joined_periods_users = JOIN 
      $first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT,
      $second_period_users BY (user_id, gg_id, dimension_name, dimension_value);
      

      output:

      (108,11,all_users,all_users,108,11,all_users,all_users)
      (138,11,all_users,all_users,,,,)
      (108,13,all_users,all_users,108,13,all_users,all_users)
      (108,17,all_users,all_users,,,,)
      

      BUT, if I add USING 'replicated', the result is completely different:

      $joined_periods_users = JOIN 
      $first_period_users BY (user_id, gg_id, dimension_name, dimension_value) LEFT,
      $second_period_users BY (user_id, gg_id, dimension_name, dimension_value) 
      USING 'replicated';
      

      output:

      (108,11,all_users,all_users,108,11,all_users,all_users)
      (108,11,all_users,all_users,108,11,all_users,all_users)
      (108,11,all_users,all_users,108,11,all_users,all_users)
      (108,11,all_users,all_users,108,11,all_users,all_users)
      (108,11,all_users,all_users,108,11,all_users,all_users)
      (108,11,all_users,all_users,108,11,all_users,all_users)
      (108,11,all_users,all_users,108,11,all_users,all_users)
      (108,13,all_users,all_users,108,13,all_users,all_users)
      (108,13,all_users,all_users,108,13,all_users,all_users)
      (108,13,all_users,all_users,108,13,all_users,all_users)
      (108,13,all_users,all_users,108,13,all_users,all_users)
      (108,13,all_users,all_users,108,13,all_users,all_users)
      (108,13,all_users,all_users,108,13,all_users,all_users)
      (108,13,all_users,all_users,108,13,all_users,all_users)
      (108,17,all_users,all_users,,,,)
      (138,11,all_users,all_users,,,,)
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            Zibi Zbigniew Rzepka
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: