Details
-
Sub-task
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.10.0
Description
In DefaultScheduler#assignResourceOrHandleError(), if the deployment is outdated, we should release the possibly acquired LogicalSlot so that we do not leak resources.
Below is an example to illustrate how slot leak is currently possible:
- Vertices A1, A2, A3 are scheduled in a batch.
- A2 acquires a slot. A1, A3 do not.
- A1 fails due to slot allocation timeout and triggers failover (DefaultScheduler#cancelTasksAsync)
- A2 is canceled first and its returned slot is assigned to A3, which triggers DefaultScheduler#assignResourceOrHandleError of A3.
However, A3 is not canceled yet but it is outdated because executionVertexVersioner#recordVertexModifications was already invoked