[FLINK-21883] Introduce cooldown period into adaptive scheduler - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Not a Priority
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.19.0
Component/s: Runtime / Coordination
Labels:

Description

This is a follow up to reactive mode, introduced in ~~FLINK-10407~~.

Introduce a cooldown timeout, during which no further scaling actions are performed, after a scaling action.
Without such a cooldown timeout, it can happen with unfortunate timing, that we are rescaling the job very frequently, because TaskManagers are not all connecting at the same time.
With the current implementation (1.13), this only applies to scaling up, but this can also apply to scaling down with autoscaling support.

With this implemented, users can define a cooldown timeout of say 5 minutes: If taskmanagers are now slowly connecting one after another, we will only rescale every 5 minutes.

Attachments

Issue Links

causes

FLINK-34272 AdaptiveSchedulerClusterITCase failure due to MiniCluster not running

Resolved

FLINK-33976 AdaptiveScheduler cooldown period is taken from a wrong configuration

Resolved

is duplicated by

FLINK-32484 AdaptiveScheduler combined restart during scaling out

Closed

is related to

FLINK-10407 FLIP-159: Reactive mode

Closed

links to

GitHub Pull Request #22985

Activity

People

Assignee:: Etienne Chauchot

Reporter:: Robert Metzger

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 19/Mar/21 18:50

Updated:: 30/Jan/24 20:47

Resolved:: 26/Oct/23 04:55