Details
-
Improvement
-
Status: Open
-
Not a Priority
-
Resolution: Unresolved
-
1.11.0
-
None
Description
With a high degree of parallelism, we end up with n*s number of files in each checkpoint (n = parallelism, s = stages). Writing them if fast (from many subtasks), removing them is slow (from JM).
This can't be mitigated by state.backend.fs.memory-threshold because most states are ten to hundreds Mb.
Instead of going through them 1 by 1, we could remove the directory recursively.
The easiest way is to remove channelStateHandle.discard() calls and use isRecursive=true in FsCompletedCheckpointStorageLocation.disposeStorageLocation.
Note: with the current isRecursive=false there will be an exception if there are any files left under that folder.
This can be extended to other state handles in future as well.
Attachments
Issue Links
- duplicates
-
FLINK-13856 Reduce the delete file api when the checkpoint is completed
- Open
- relates to
-
FLINK-17073 Slow checkpoint cleanup causing OOMs
- Closed
-
FLINK-22682 Checkpoint interval too large for higher DOP
- Closed