[PIG-1309] Sort Merge Cogroup - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.7.0, 0.8.0
Component/s: impl
Labels:
None

Release Note:

Hide
With this patch, it is now possible to perform map-side cogroup if data is sorted and loader implements certain interfaces. Primary algorithm is based on sort-merge join with additional restrictions.

Following preconditions must be met to use this feature:
1) No other operations can be done between load and cogroup statements.
2) Data must be sorted on join keys for all tables in ASC order.
3) Nulls are considered smaller then everything. So, if data contains null keys, they should occur before anything else.
4) Left-most loader must implement {CollectableLoader} interface as well as {OrderedLoadFunc}.
5) All other loaders must implement IndexableLoadFunc.
6) Type information must be provided in schema for all the loaders.

Note that Zebra loader satisfies all of these conditions, so can be used out of box.

Similar conditions apply to map-side outer joins (using merge) (~~PIG-1353~~) as well.

Example:
A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted');
B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted');
C = COGROUP A by id, B by id using 'merge';

Show
With this patch, it is now possible to perform map-side cogroup if data is sorted and loader implements certain interfaces. Primary algorithm is based on sort-merge join with additional restrictions. Following preconditions must be met to use this feature: 1) No other operations can be done between load and cogroup statements. 2) Data must be sorted on join keys for all tables in ASC order. 3) Nulls are considered smaller then everything. So, if data contains null keys, they should occur before anything else. 4) Left-most loader must implement {CollectableLoader} interface as well as {OrderedLoadFunc}. 5) All other loaders must implement IndexableLoadFunc. 6) Type information must be provided in schema for all the loaders. Note that Zebra loader satisfies all of these conditions, so can be used out of box. Similar conditions apply to map-side outer joins (using merge) ( PIG-1353 ) as well. Example: A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); C = COGROUP A by id, B by id using 'merge';

Description

In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( ~~PIG-984~~ ) and Joins( ~~PIG-845~~ , ~~PIG-554~~ ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

mapsideCogrp.patch
19/Mar/10 18:48
55 kB
Ashutosh Chauhan
pig-1309_1.patch
30/Mar/10 00:54
83 kB
Ashutosh Chauhan
pig-1309_2.patch
01/Apr/10 21:45
93 kB
Ashutosh Chauhan
PIG_1309_7.patch
02/Jul/10 19:23
95 kB
Ashutosh Chauhan

Activity

People

Assignee:: Ashutosh Chauhan

Reporter:: Ashutosh Chauhan

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 19/Mar/10 18:46

Updated:: 17/Dec/10 22:43

Resolved:: 09/Jul/10 17:49