[TEZ-3159] Reduce memory utilization while serializing keys and values - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Patch Available
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

Currently DataOutputBuffer is used for serializing. The underlying buffer keeps doubling in size when it reaches capacity. In some of the Pig scripts which serialize big bags, we end up with OOM in Tez as there is no space to double the array size. Mapreduce mode runs fine in those cases with 1G heap. The scenarios are

When combiner runs in reducer and some of the fields after combining are still big bags (For eg: distinct). Currently with mapreduce combiner does not run in reducer - MAPREDUCE-5221. Since input sort buffers hold good amount of memory at that time it can easily go OOM.
While serializing output with bags when there are multiple inputs and outputs and the sort buffers for those take up space.

It is a pain especially after buffer size hits 128MB. Doubling at 128MB will require 128MB (existing array) +256MB (new array). Any doubling after that requires even more space. But most of the time the data is probably not going to fill up that 256MB leading to wastage.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TEZ-3159.005.patch
15/Aug/17 19:48
56 kB
Muhammad Samir Khan
TEZ-3159.004.patch
15/Aug/17 17:31
47 kB
Muhammad Samir Khan
TEZ-3159.003.patch
11/Aug/17 17:59
28 kB
Muhammad Samir Khan
TEZ-3159.002.patch
09/Aug/17 20:28
23 kB
Muhammad Samir Khan
TEZ-3159.001.patch
09/Aug/17 17:38
22 kB
Muhammad Samir Khan

Activity

People

Assignee:: Muhammad Samir Khan

Reporter:: Rohini Palaniswamy

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 09/Mar/16 20:02

Updated:: 15/Aug/17 19:48