Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
1.12.0
Description
A user reported that cancelling a job can lead to an uncaught exception which kills the JobMaster. The problem seems to be that the CheckpointsCleaner might trigger CheckpointCoordinator actions after the job has reached a terminal state and, thus, is shut down. Apparently, we do not properly manage the lifecycles of CheckpointCoordinator and checkpoint post clean up actions.
The uncaught exception is
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@41554407 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@5d0ec6f7[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 25977] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063 at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830 at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:326 at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533 at java.util.concurrent.ScheduledThreadPoolExecutor.execute(ScheduledThreadPoolExecutor.java:622 at java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668 at org.apache.flink.runtime.concurrent.ScheduledExecutorServiceAdapter.execute(ScheduledExecutorServiceAdapter.java:62 at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.scheduleTriggerRequest(CheckpointCoordinator.java:1152 at org.apache.flink.runtime.checkpoint.CheckpointsCleaner.lambda$cleanCheckpoint$0(CheckpointsCleaner.java:58 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624 at java.lang.Thread.run(Thread.java:748 undefined)
Attachments
Issue Links
- is caused by
-
FLINK-17073 Slow checkpoint cleanup causing OOMs
- Closed
- is duplicated by
-
FLINK-20993 Cleaning up checkpoint during shutdown may fail JM
- Closed
-
FLINK-23874 JM did not store latest checkpiont id into Zookeeper, silently
- Closed
- is related to
-
FLINK-21053 Prevent potential RejectedExecutionExceptions in CheckpointCoordinator failing JM
- Open
- links to