Details
-
Sub-task
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.14.3, 1.15.0
Description
The DefaultDeclarativeSlotPool rejects offered slots if the job is currently restarting. The problem is that in case of a job restart, the scheduler sets the required resources to zero. Hence, all offered slots will be rejected.
This is a problem for local recovery because rejected slots will be freed by the TaskExecutor and thereby all local state will be deleted. Hence, in order to properly support local recovery, we need to handle this situation somehow. I do see different options here:
This problem only affects the DefaultScheduler since the AdaptiveScheduler sets the required resources when transitioning into the WaitingForResources state.
Accept excess slots
Accepting excess slots means that the DefaultDeclarativeSlotPool accepts slots which exceed the currently required set of slots.
Advantages:
- Easy to implement
Disadvantages:
- Offered slots that are not really needed will only be freed after the idle slot timeout. This means that some resources might be left unused for some time.
Let DefaultDeclarativeSlotPool accept excess slots only when job is restarting
Here the idea is to only accept excess slots when the job is currently restarting. This will required that the scheduler tells the DefaultDeclarativeSlotPool about the restarting state.
Advantages:
- We would only accept excess slots for the time of restarting
Disadvantages:
- We are complicating the semantics of the DefaultDeclarativeSlotPool. Moreover, we are introducing additional signals that communicate the restarting state to the pool.
Don't immediately free slots on the TaskExecutor when they are rejected
Instead of freeing the slot immediately on the TaskExecutor after it is rejected. We could also retry for some time and only free the slot after some timeout.
Advantages:
- No changes on the JobMaster side needed.
Disadvantages:
- Complication of the slot lifecycle on the TaskExecutor
- Unneeded slots are not made available for other jobs as fast as possible
Don't zero resource requirements during job restart
Instead of zeroing the resource requirements during a job restart, we could also keep the last know requirements. Once the job is restarted, we could adjust the requirements.
Advantages:
- Conceptually easy to do
Disadvantages:
- The old requirements mustn't necessarily be the new ones
- Convolutes logic in the scheduler
Attachments
Attachments
Issue Links
- causes
-
FLINK-26547 Accepting slots without a matching requirement leads to unfulfillable requirements
- Closed
- links to