Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
None
Description
In the past, there were multiple bugs caused by throwing/handling RejectedExecutionException in CheckpointCoordinator (FLINK-18290, FLINK-20992).
And I think it's still possible as there are many places where an executor is passed to calls to CompletableFuture.xxxAsync while it can already be shut down.
In FLINK-20992 we discussed two approaches to fix this.
One approach is to check executor state inside a synchronized block every time when it is used.
Second approach is to
- Create executors inside CheckpointCoordinator (both io & timer thread pools)
- Check isShutdown() in their RejectedExecution handlers (if yes and it's RejectedExecutionException then just log; otherwise delegate to FatalExitExceptionHandler)
- (this will allow to remove such RejectedExecutionException checks from coordinator code)
Attachments
Issue Links
- relates to
-
FLINK-20992 Checkpoint cleanup can kill JobMaster
- Closed