[FLINK-11205] Task Manager Metaspace Memory Leak - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Duplicate
Affects Version/s: 1.5.5, 1.6.2, 1.7.0
Fix Version/s: None
Component/s: Runtime / Coordination
Labels:
None

Flags:

Important

Description

Job Restarts causes task manager to dynamically load duplicate classes. Metaspace is unbounded and grows with every restart. YARN aggressively kill such containers but this affect is immediately seems on different task manager which results in death spiral.

Task Manager uses dynamic loader as described in https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/debugging_classloading.html

YARN

YARN classloading differs between single job deployments and sessions:

When submitting a Flink job/application directly to YARN (via bin/flink run -m yarn-cluster ...), dedicated TaskManagers and JobManagers are started for that job. Those JVMs have both Flink framework classes and user code classes in the Java classpath. That means that there is no dynamic classloading involved in that case.

When starting a YARN session, the JobManagers and TaskManagers are started with the Flink framework classes in the classpath. The classes from all jobs that are submitted against the session are loaded dynamically.

The above is not entirely true specially when you set -yD classloader.resolve-order=parent-first . We also above observed the above behaviour when submitting a Flink job/application directly to YARN (via bin/flink run -m yarn-cluster ...).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Screenshot 2018-12-18 at 12.14.11.png
19/Dec/18 15:26
176 kB
Nawaid Shamim
Screenshot 2018-12-18 at 15.47.55.png
19/Dec/18 15:47
166 kB
Nawaid Shamim

Issue Links

duplicates

FLINK-16408 Bind user code class loader to lifetime of a slot

Resolved

is related to

FLINK-9080 Flink Scheduler goes OOM, suspecting a memory leak

Closed

FLINK-10317 Configure Metaspace size by default

Closed

relates to

FLINK-16142 Memory Leak causes Metaspace OOM error on repeated job submission

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Nawaid Shamim

Votes:: 1 Vote for this issue

Watchers:: 25 Start watching this issue

Dates

Created:: 19/Dec/18 15:29

Updated:: 18/May/20 14:15

Resolved:: 18/May/20 14:15