[FLINK-19142] Local recovery can be broken if slot hijacking happened during a full restart - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.12.0
Fix Version/s: 1.14.3, 1.15.0
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Description

The ticket originates from this PR discussion.

The previous AllocationIDs are used by PreviousAllocationSlotSelectionStrategy to schedule subtasks into the slot where they were previously executed before a failover. If the previous slot (AllocationID) is not available, we do not want subtasks to take previous slots (AllocationIDs) of other subtasks.

The MergingSharedSlotProfileRetriever gets all previous AllocationIDs of the bulk from SlotSharingExecutionSlotAllocator but only from the current bulk. The previous AllocationIDs of other bulks stay unknown. Therefore, the current bulk can potentially hijack the previous slots from the preceding bulks. On the other hand the previous AllocationIDs of other tasks should be taken if the other tasks are not going to run at the same time, e.g. not enough resources after failover or other bulks are done.

Local recovery can be broken due to this. e.g. when multiple regions of a streaming job are restarted at the same time(due to global failover, or task failover with `full` failover strategy).

Attachments

Issue Links

causes

FLINK-24793 DefaultSchedulerLocalRecoveryITCase fails on AZP

Closed

is a child of

FLINK-18689 Deterministic Slot Sharing

Closed

relates to

FLINK-16430 FLIP-119 Pipelined Region Scheduling

Closed

links to

GitHub Pull Request #15229

Activity

People

Assignee:: Zhu Zhu

Reporter:: Andrey Zagrebin

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 04/Sep/20 12:11

Updated:: 15/Dec/21 01:44

Resolved:: 09/Dec/21 02:46