[FLINK-35853] Regression in checkpoint size when performing full checkpointing in RocksDB - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.18.1
Fix Version/s: None
Component/s: Runtime / State Backends
Labels:
None
Environment:

amazon-linux-2023

Description

We have a job with small and static state size (states are updated instead of added), the job is configured to use RocksDB + full checkpointng (incremental disabled) because the diff between checkpoint is larger than full checkpoint size.

After migrating to 1.18, we observed significant and steady increase in full checkpoint size with RocksDB + full checkpointing. The increase was not observed with hashmap state backend.

I managed to reproduce the issue with following code:

StaticStateSizeGenerator115.java
StaticStateSizeGenerator118.java

Result:

On Flink 1.15, RocksDB + full checkpointing, checkpoint size is constant at 250KiB.
On Flink 1.18, RocksDB + full checkpointing, max checkpoint size got up to 38MiB before dropping (presumably due to compaction?)
On Flink 1.18, Hashmap statebackend, checkpoint size is constant at 219KiB.

Notes:

One observation I have is that the issue is more pronounced with higher parallelism, the code uses 8 parallelism. The production application that we first saw the regression got up to GiB of checkpoint size, where we only expected and observed (in 1.15) at most a couple of MiB.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

StaticStateSizeGenerator115.java
16/Jul/24 07:49
8 kB
Keith Lee
StaticStateSizeGenerator118.java
16/Jul/24 07:50
7 kB
Keith Lee

Activity

People

Assignee:: Unassigned

Reporter:: Keith Lee

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 16/Jul/24 07:54

Updated:: 22/Jul/24 11:13