[PIG-3979] group all performance, garbage collection, and incremental aggregation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.12.0, 0.11.1
Fix Version/s: 0.14.0
Component/s: impl
Labels:
None

Hadoop Flags:

Reviewed

Description

I have a PIG statement similar to:
summary = foreach (group data ALL) generate
COUNT(data.col1), SUM(data.col2), SUM(data.col2)
, Moments(col3)
, Moments(data.col4)

There are a couple of hundred columns.

I set the following:
SET pig.exec.mapPartAgg true;
SET pig.exec.mapPartAgg.minReduction 3;
SET pig.cachedbag.memusage 0.05;

I found that when I ran this on a JVM with insufficient memory, the process eventually timed out because of an infinite garbage collection loop.

The problem was invariant to the memusage setting.

I solved the problem by making changes to:
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperator.POPartialAgg.java

Rather than reading in 10000 records to establish an estimate of the reduction, I make an estimate after reading in enough tuples to fill pig.cachedbag.memusage percent of Runtime.getRuntime().maxMemory().

I also made a change to guarantee at least one record allowed in second tier storage. In the current implementation, if the reduction is very high 1000:1, space in second tier storage is zero.

With these changes, I can summarize large data sets with small JVMs. I also find that setting pig.cachedbag.memusage to a small number such as 0.05 results in much better garbage collection performance without reducing throughput. I suppose tuning GC would also solve a problem with excessive garbage collection.

The performance is sweet.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PIG-3979-v1.patch
02/Jun/14 12:28
3 kB
David Dreyfus
POPartialAgg.java.patch
16/Jun/14 12:44
38 kB
David Dreyfus
SpillableMemoryManager.java.patch
16/Jun/14 12:44
9 kB
David Dreyfus
PIG-3979-3.patch
09/Oct/14 22:34
20 kB
Daniel Dai
PIG-3979-4.patch
13/Oct/14 00:20
21 kB
Daniel Dai
PIG-3979-synchronous-spill.patch
20/Oct/14 17:49
20 kB
Rohini Palaniswamy
PIG-3979-5.patch
24/Oct/14 21:52
32 kB
Rohini Palaniswamy
PIG-3979-6.patch
02/Nov/14 13:25
34 kB
Rohini Palaniswamy
PIG-3979-7.patch
03/Nov/14 02:49
35 kB
Rohini Palaniswamy

Issue Links

breaks

PIG-4564 Pig can deadlock in POPartialAgg if there is a bag

Closed

Activity

People

Assignee:: Rohini Palaniswamy

Reporter:: David Dreyfus

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 01/Jun/14 19:37

Updated:: 23/May/15 18:31

Resolved:: 03/Nov/14 15:17