[FLINK-17860] Recursively remove channel state directories - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Not a Priority
Resolution: Unresolved
Affects Version/s: 1.11.0
Fix Version/s: None
Component/s: Runtime / Checkpointing
Labels:

Description

With a high degree of parallelism, we end up with n*s number of files in each checkpoint (n = parallelism, s = stages). Writing them if fast (from many subtasks), removing them is slow (from JM).

This can't be mitigated by state.backend.fs.memory-threshold because most states are ten to hundreds Mb.

Instead of going through them 1 by 1, we could remove the directory recursively.

The easiest way is to remove channelStateHandle.discard() calls and use isRecursive=true in FsCompletedCheckpointStorageLocation.disposeStorageLocation.

Note: with the current isRecursive=false there will be an exception if there are any files left under that folder.

This can be extended to other state handles in future as well.

Attachments

Issue Links

duplicates

FLINK-13856 Reduce the delete file api when the checkpoint is completed

Open

relates to

FLINK-17073 Slow checkpoint cleanup causing OOMs

Closed

FLINK-22682 Checkpoint interval too large for higher DOP

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Roman Khachatryan

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 21/May/20 08:03

Updated:: 12/Feb/22 10:37