Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
1.5.0, 1.6.2
Description
In order to make local recovery work, Flink's scheduling was changed such that it tries to be rescheduled to its previous location. In order to not occupy slots which have state of other tasks cached, the strategy will request a new slot if the old slot identified by the previous allocation id is no longer present. This also applies to newly allocated slots because there is no distinction between new or already used. This behaviour can cause that every tasks gets deployed to its own slot if the SlotPool has released all slots in the meantime, for example. The consequence could be that a job can no longer be executed after a failure because it needs more slots than before.
Attachments
Issue Links
- is duplicated by
-
FLINK-9583 Wrong number of TaskManagers' slots after recovery.
- Closed
-
FLINK-12245 Transient slot allocation failure on job recovery
- Closed
- relates to
-
FLINK-9892 Disable local recovery in Jepsen tests
- Resolved
-
FLINK-9634 Deactivate previous location based scheduling if local recovery is disabled
- Closed
- links to