Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-1353

Map-side outer joins

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.8.0
    • impl
    • None
    • Hide
      With this patch, it is now possible to perform [left|right|full] outer joins on two tables as well as inner joins on more then two tables in Pig in map-side if data is sorted and loaders implement required interfaces. Primary algorithm is based on sort-merge join.

      Following preconditions should be met in order to use this feature:
      1) No other operations can be done between load and join statements.
      2) Data must be sorted on join keys in ASC order.
      3) Nulls are considered smaller then everything. So, if data contains null keys, they should occur before anything else.
      4) Left-most loader must implement {CollectableLoader} interface as well as {OrderedLoadFunc}.
      5) All other loaders must implement {IndexableLoadFunc}.
      6) Type information must be provided in schema for all the loaders.

      Note that Zebra loader satisfies all of these conditions, so can be used out of box.

      Similar conditions apply to map-side cogroups (PIG-1309) as well.

      Example:
      A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted');
      B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted');
      C = join A by id left, B by id using 'merge';
      .....
      Show
      With this patch, it is now possible to perform [left|right|full] outer joins on two tables as well as inner joins on more then two tables in Pig in map-side if data is sorted and loaders implement required interfaces. Primary algorithm is based on sort-merge join. Following preconditions should be met in order to use this feature: 1) No other operations can be done between load and join statements. 2) Data must be sorted on join keys in ASC order. 3) Nulls are considered smaller then everything. So, if data contains null keys, they should occur before anything else. 4) Left-most loader must implement {CollectableLoader} interface as well as {OrderedLoadFunc}. 5) All other loaders must implement {IndexableLoadFunc}. 6) Type information must be provided in schema for all the loaders. Note that Zebra loader satisfies all of these conditions, so can be used out of box. Similar conditions apply to map-side cogroups ( PIG-1309 ) as well. Example: A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); C = join A by id left, B by id using 'merge'; .....

    Description

      Pig already has couple of map-side join implementations: Merge Join and Fragmented-Replicate Join. But both of them are pretty restrictive. Merge Join can only join two tables and that too can only do inner join. FR Join can join multiple relations, but it can also only do inner and left outer joins. Further it restricts the sizes of side relations. It will be nice if we can do map side joins on multiple tables as well do inner, left outer, right outer and full outer joins.

      Lot of groundwork for this has already been done in PIG-1309. Remaining will be tracked in this jira.

      Attachments

        1. pig-1353.patch
          10 kB
          Ashutosh Chauhan
        2. ASF.LICENSE.NOT.GRANTED--pig-1353.patch
          60 kB
          Ashutosh Chauhan

        Activity

          People

            ashutoshc Ashutosh Chauhan
            ashutoshc Ashutosh Chauhan
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: