[YARN-7147] ATS1.5 crash due to OOM - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: timelineserver
Labels:
None

Description

It is observed that in production cluster, though app-cache-size is set to minimal i.e less than 5, ATS server is going down with OOM. The entity-group-fs-store.cache-store-class is configured with MemoryTimelineStore which is by default. The heap size configured for ATS daemon is 8GB.

This is because ATS parse the entity log file per domain and caches it. If the domain has lot of entity information, then in memory cache store loads all the entity information which is causing OOM. After restart, again it caches same domain and goes OOM.

There are possible way handle it are

threshold the number of entities loaded into in memory cache. This still can lead to OOM if data size is huge.
Based on the data size in the store.

We faced 1st issue where number of entities are very huge.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Screen Shot - suspect-2.png
01/Sep/17 13:40
460 kB
Rohith Sharma K S
Screen Shot - suspect-1.png
01/Sep/17 13:39
474 kB
Rohith Sharma K S

Issue Links

is duplicated by

YARN-4219 New levelDB cache storage for timeline v1.5

Resolved

Activity

People

Assignee:: Rohith Sharma K S

Reporter:: Rohith Sharma K S

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 01/Sep/17 13:36

Updated:: 01/Sep/17 15:25

Resolved:: 01/Sep/17 15:19