[PIG-2829] Use partial aggregation more aggresively - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.10.0
Fix Version/s: None
Component/s: None
Labels:
None

Description

Partial aggregation (Hash Aggregation, aka in-map combiner) is a new feature in Pig 0.10 that will perform aggregation within map function. The main advantage against combiner is it avoids de/serializing and sorting the data, and it can auto disable itself if the data reduction rate is low. Currently it's disabled by default.

To leverage the power of PartialAgg more aggressively, several things need to be revisited:

1. The threshold of auto-disabling. Currently each mapper looks at first 1k (hard-coded) records to see if there's enough data size reduction (defaults to 10x, configurable). The check would happen earlier if the hash table gets full before processing the 1k records (hash table size is controlled by pig.cachedbag.memusage). We might want to relax these thresholds.

2. Dependency on the combiner. Currently the PartialAgg won't work without a combiner following it, so we need to provide separate options to enable each independently.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

2829.2.patch
27/Jul/12 20:44
20 kB
Jie Li
2829.1.patch
26/Jul/12 23:45
11 kB
Jie Li
2829.separate.options.patch
26/Jul/12 17:11
4 kB
Jie Li
tpch-10G.png
19/Jul/12 23:24
100 kB
Jie Li
pigmix-10G.png
19/Jul/12 23:24
114 kB
Jie Li

Issue Links

is related to

PIG-2228 support partial aggregation in map task

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Jie Li

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 19/Jul/12 22:34

Updated:: 06/Sep/12 05:08