[PIG-4344] Add a testcase with CustomPartitioner that tests ordering within a reducer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

Some of our users have a CustomPartitioner with join or group by as they know their data and know the keys to partition on. Since mapreduce provides data sorted within a reducer, they rely on that to have the data ordered as well.

For eg:
partition = group mydata by (hour, sortkey1, sortkey2, sortkey3) using MyCustomPartitioner PARALLEL 24;

The custom partitioner sends hours 0-23 to partitions 0-23, which ensures that the data is also sorted without having to do a group by.

With HCatStorer, this pattern will be used more. i.e,
partition = group mydata by (hour) using MyCustomPartitioner PARALLEL 24;
store partition into 'mydb.mytable' using HCatStorer();
instead of
store mydata into 'mydb.mytable' using HCatStorer();

where hour is the partition. The extra groupby above is to avoid having 1 file created per partition instead of 24 files per partition and concatenating them later to save namespace.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Rohini Palaniswamy

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 26/Nov/14 17:25

Updated:: 26/Nov/14 17:25