[PIG-1038] Optimize nested distinct/sort to use secondary key - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.4.0
Fix Version/s: 0.6.0
Component/s: impl
Labels:
None

Hadoop Flags:

Reviewed

Description

If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query.

Eg1:
A = load 'mydata';
B = group A by $0;
C = foreach B {
D = order A by $1;
generate group, D;
}
store C into 'myresult';

We can specify a secondary sort on A.$1, and drop "order A by $1".

Eg2:
A = load 'mydata';
B = group A by $0;
C = foreach B {
D = A.$1;
E = distinct D;
generate group, E;
}
store C into 'myresult';

We can specify a secondary sort key on A.$1, and simplify "D=A.$1; E=distinct D" to a special version of distinct, which does not do the sorting.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PIG-1038-1.patch
08/Nov/09 07:07
110 kB
Daniel Dai
PIG-1038-2.patch
09/Nov/09 04:33
110 kB
Daniel Dai
PIG-1038-3.patch
11/Nov/09 08:27
128 kB
Daniel Dai
PIG-1038-4.patch
12/Nov/09 00:08
109 kB
Daniel Dai
PIG-1038-5.patch
12/Nov/09 03:11
109 kB
Daniel Dai

Issue Links

is related to

PIG-1295 Binary comparator for secondary sort

Closed

Activity

People

Assignee:: Daniel Dai

Reporter:: Olga Natkovich

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 21/Oct/09 21:56

Updated:: 24/Mar/10 22:15

Resolved:: 12/Nov/09 07:49