[PIG-1249] Safe-guards against misconfigured Pig scripts without PARALLEL keyword - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 0.8.0
Fix Version/s: 0.8.0
Component/s: None
Labels:
None

Release Note:

Hide
In the previous versions of Pig, if the number of reducers was not specified (via PARALLEL or default_parallel), the value of 1 was used which in many cases was not a good choice and caused severe performance problems.

In Pig 0.8.0, a simple heuristic is used to come up with a better number based on the size of the input data. There are several parameters that the user can control:

pig.exec.reducers.bytes.per.reducer - define number of input bytes per reduce; default value is 1000*1000*1000 (1GB)
pig.exec.reducers.max - defines the upper bound on the number of reducers; default is 999

The formula is very simple:

#reducers = MIN (pig.exec.reducers.max, total input size (in bytes) / bytes per reducer.

This is a very simplistic formula that we would need to improve over time. Note, that the computed value takes all inputs within the script into account and applies the computed value to all the jobs within Pig script.

Note that this is not a backward compatible change and set default_parallel to restore the value to 1

Show
In the previous versions of Pig, if the number of reducers was not specified (via PARALLEL or default_parallel), the value of 1 was used which in many cases was not a good choice and caused severe performance problems. In Pig 0.8.0, a simple heuristic is used to come up with a better number based on the size of the input data. There are several parameters that the user can control: pig.exec.reducers.bytes.per.reducer - define number of input bytes per reduce; default value is 1000*1000*1000 (1GB) pig.exec.reducers.max - defines the upper bound on the number of reducers; default is 999 The formula is very simple: #reducers = MIN (pig.exec.reducers.max, total input size (in bytes) / bytes per reducer. This is a very simplistic formula that we would need to improve over time. Note, that the computed value takes all inputs within the script into account and applies the computed value to all the jobs within Pig script. Note that this is not a backward compatible change and set default_parallel to restore the value to 1

Description

It would be very useful for Pig to have safe-guards against naive scripts which process a lot of data without the use of PARALLEL keyword.

We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PIG_1249_2.patch
24/May/10 15:18
8 kB
Jeff Zhang
PIG_1249_3.patch
26/May/10 16:39
9 kB
Jeff Zhang
PIG-1249_5.patch
27/Jul/10 09:06
10 kB
Jeff Zhang
PIG-1249.patch
17/May/10 16:00
7 kB
Jeff Zhang
PIG-1249-4.patch
02/Jun/10 20:43
9 kB
Alan Gates

Issue Links

relates to

MAPREDUCE-1521 Protection against incorrectly configured reduces

Resolved

Activity

People

Assignee:: Jeff Zhang

Reporter:: Arun Murthy

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 22/Feb/10 18:20

Updated:: 19/Jan/11 09:01

Resolved:: 29/Jul/10 00:25