Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Duplicate
-
1.10.0
-
None
Description
As in the attached JM log, the job tried to start 30 TMs but only 29 are registered. So the job fails due to not able to acquire all 30 slots needed in time.
And when the failover happens and tasks are re-scheduled, the RM will not ask for new TMs even if it cannot fulfill the slot requests. So the job will keep failing for slot allocation timeout.
Attachments
Attachments
Issue Links
- relates to
-
FLINK-13554 ResourceManager should have a timeout on starting new TaskExecutors.
- Closed