[FLINK-9635] Local recovery scheduling can cause spread out of tasks - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.5.0, 1.6.2
Fix Version/s: 1.6.3, 1.7.0
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Release Note:

Hide
With the improvements to Flink's scheduling, it can no longer happen that recoveries require more slots than before if local recovery is enabled. Consequently, we encourage our users to use the local recovery feature which can be enabled by `state.backend.local-recovery: true`.

Show
With the improvements to Flink's scheduling, it can no longer happen that recoveries require more slots than before if local recovery is enabled. Consequently, we encourage our users to use the local recovery feature which can be enabled by `state.backend.local-recovery: true`.

Description

In order to make local recovery work, Flink's scheduling was changed such that it tries to be rescheduled to its previous location. In order to not occupy slots which have state of other tasks cached, the strategy will request a new slot if the old slot identified by the previous allocation id is no longer present. This also applies to newly allocated slots because there is no distinction between new or already used. This behaviour can cause that every tasks gets deployed to its own slot if the SlotPool has released all slots in the meantime, for example. The consequence could be that a job can no longer be executed after a failure because it needs more slots than before.

Attachments

Issue Links

is duplicated by

FLINK-9583 Wrong number of TaskManagers' slots after recovery.

Closed

FLINK-12245 Transient slot allocation failure on job recovery

Closed

relates to

FLINK-9892 Disable local recovery in Jepsen tests

Resolved

FLINK-9634 Deactivate previous location based scheduling if local recovery is disabled

Closed

links to

GitHub Pull Request #6961

GitHub Pull Request #6972

(1 links to)

Activity

People

Assignee:: Stefan Richter

Reporter:: Till Rohrmann

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 21/Jun/18 08:06

Updated:: 18/Apr/19 12:40

Resolved:: 01/Nov/18 10:36