Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
1.9.3, 1.10.1, 1.11.0
Description
With FLINK-9932 we loosened the slot allocation protocol in a way that the JobMaster can deploy Tasks into a slot which has not been ACTIVATED but only ALLOCATED for a given job. This allowed to better handle the case where the JobMasterGateway#offerSlots response was late so that it timed out. The way it was solved is to offer a TaskSlotTable#tryMarkSlotActive method which, in contrast to TaskSlotTable#markSlotActive, would not fail if the requested slot was not available.
However, the problem is that the former method does not deactivate the slot timeout. Hence, it can happen if the offerSlots response never arrives at the TaskExecutor that an ACTIVATED slot times out.
In order to fix the problem, we should also deactivate the slot timeout when TaskSlotTable#tryMarkSlotActive is being called.
Attachments
Issue Links
- is related to
-
FLINK-17404 Running HA per-job cluster (rocks, non-incremental) gets stuck killing a non-existing pid in Hadoop 3 build profile
-
- Closed
-
- links to