[FLINK-16225] Metaspace Out Of Memory should be handled as Fatal Error in TaskManager - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.10.0
Fix Version/s: 1.10.2, 1.11.0
Component/s: Runtime / Task
Labels:
- pull-request-available
- usability

Description

When an OutOfMemory (Metaspace) exception happens, there is usually no way to recover. This is often the result of user code or libraries that have subtle class loading leaks.

The one way to recover is to kill the TaskManagers and to let the resource orchestrators (K8s, Yarn, Mesos) restart them. Flink's fault tolerance should then be able to recover the job.

I would suggest to implement this the following way:

The user code ClassLoader takes an "OOM Handler", which is called when class loading causes an OOM exception.
The handler wraps this into an Exception with a good error message (see below) and invokes the TaskManager's FatalErrorHandler.
The FatalErrorHandler in turn should attempt to cancel everything and notify the JM before shutting down. That way, we get decent error reporting and users can see what is going on.

The error message should describe the following:

If user sees the error consistently on the first deploy, then the metaspace is simply too small for their application, and they need to explicitly increase it in the configuration
If the user sees occasionally TaskManagers in a session cluster failing with that exception when deploying new jobs, then some user code or library probably has a class leak. The TM failure / restart is done in order to forcefully clean up.

Attachments

Issue Links

causes

FLINK-17514 TaskCancelerWatchdog does not kill TaskManager

Resolved

is blocked by

FLINK-16245 Use a delegating classloader as the user code classloader to prevent class leaks.

Closed

supercedes

FLINK-16142 Memory Leak causes Metaspace OOM error on repeated job submission

Closed

links to

Github PR #11408 (Only improves error message)

GitHub Pull Request #12446

GitHub Pull Request #12563

(1 links to)

Activity

People

Assignee:: Andrey Zagrebin

Reporter:: Stephan Ewen

Votes:: 0 Vote for this issue

Watchers:: 20 Start watching this issue

Dates

Created:: 21/Feb/20 17:17

Updated:: 22/Jul/20 12:08

Resolved:: 15/Jun/20 14:32