[FLINK-9132] Cluster runs out of task slots when a job falls into restart loop - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Won't Fix
Affects Version/s: 1.4.2
Fix Version/s: None
Component/s: Runtime / Coordination
Labels:
None
Environment:

Hide

env.java.opts in flink-conf.yaml file:

env.java.opts: -Xloggc:/home/user/flink/log/flinkServer-gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseG1GC -XX:MaxGCPauseMillis=150 -XX:InitiatingHeapOccupancyPercent=55 -XX:+ParallelRefProcEnabled -XX:ParallelGCThreads=2 -XX:-ResizePLAB -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=100M

Show
env.java.opts in flink-conf.yaml file: env.java.opts: -Xloggc:/home/user/flink/log/flinkServer-gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseG1GC -XX:MaxGCPauseMillis=150 -XX:InitiatingHeapOccupancyPercent=55 -XX:+ParallelRefProcEnabled -XX:ParallelGCThreads=2 -XX:-ResizePLAB -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=100M

Description

If there's a job which is restarting in a loop, then Task Manager hosting it goes down after some time. Job manager automatically assigns the job to another Task Manager and the new Task Manager goes down as well. After some time, all Task Managers are gone. Cluster becomes paralyzed.

I've attached to TaskManager's java process using jconsole and noticed that number of loaded classes increases dramatically if a job is in restarting loop and restores from checkpoint.

See attachment for the graph with G1GC enabled for the node. Standard GC performs even worse - task manager shuts down within 20 minutes since the restart loop start.

I've also attached minimal program to reproduce the problem

please let me know if additional information is required from me.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

FailedJob.java
04/Apr/18 14:40
2 kB
Alex Smirnov
jconsole-classes.png
04/Apr/18 14:32
82 kB
Alex Smirnov

Activity

People

Assignee:: Unassigned

Reporter:: Alex Smirnov

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 04/Apr/18 14:56

Updated:: 29/Mar/19 11:45

Resolved:: 29/Mar/19 11:45