Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
We've seen a fair number of instances where naive users process huge data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce.
This is a significant problem on large clusters since it takes each attempt of the reduce a long time to shuffle and then run into problems such as local disk-space etc. Then it takes 4 such attempts.
Proposal: Come up with heuristics/configs to fail such jobs early.
Thoughts?
Attachments
Attachments
Issue Links
- is related to
-
PIG-1249 Safe-guards against misconfigured Pig scripts without PARALLEL keyword
- Closed