[FLINK-9352] In Standalone checkpoint recover mode many jobs with same checkpoint interval cause IO pressure - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.4.2, 1.5.0, 1.6.0
Fix Version/s: 1.6.0
Component/s: Runtime / State Backends
Labels:
- pull-request-available

Description

currently, the periodic checkpoint coordinator startCheckpointScheduler uses baseInterval as the initialDelay parameter. the baseInterval is also the checkpoint interval.

In standalone checkpoint mode, many jobs config the same checkpoint interval. When all jobs being recovered (the cluster restart or jobmanager leadership switched), all jobs' checkpoint period will tend to accordance. All jobs' CheckpointCoordinator would start and trigger in a approximate time point.

This caused the high IO cost in the same time period in our production scenario.

I suggest let the scheduleAtFixedRate's initial delay parameter as a API config which can let user scatter checkpoint in this scenario.

cc StephanEwen Zentol

Attachments

Issue Links

links to

GitHub Pull Request #6092

Activity

People

Assignee:: vinoyang

Reporter:: vinoyang

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 14/May/18 07:57

Updated:: 02/Oct/19 17:49

Resolved:: 07/Jul/18 09:11