[FLINK-13554] ResourceManager should have a timeout on starting new TaskExecutors. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Critical
Resolution: Done
Affects Version/s: 1.9.0
Fix Version/s: 1.13.0
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Description

Recently, we encountered a case that one TaskExecutor get stuck during launching on Yarn (without fail), causing that job cannot recover from continuous failovers.

The reason the TaskExecutor gets stuck is due to our environment problem. The TaskExecutor gets stuck somewhere after the ResourceManager starts the TaskExecutor and waiting for the TaskExecutor to be brought up and register. Later when the slot request timeouts, the job fails over and requests slots from ResourceManager again, the ResourceManager still see a TaskExecutor (the stuck one) is being started and will not request new container from Yarn. Therefore, the job can not recover from failure.

I think to avoid such unrecoverable status, the ResourceManager need to have a timeout on starting new TaskExecutor. If the starting of TaskExecutor takes too long, it should just fail the TaskExecutor and starts a new one.

Attachments

Issue Links

fixes

FLINK-16215 Start redundant TaskExecutor when JM failed

Open

is blocked by

FLINK-18620 Unify behaviors of active resource managers

Closed

is related to

FLINK-18229 Pending worker requests should be properly cleared

Closed

FLINK-15456 Job keeps failing on slot allocation timeout due to RM not allocating new TMs for slot requests

Closed

FLINK-19171 K8s Resource Manager may lead to resource leak after pod deleted

Closed

links to

GitHub Pull Request #15095

mentioned in: Page Loading...

(1 links to, 1 mentioned in)

Activity

People

Assignee:: Xintong Song

Reporter:: Xintong Song

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 02/Aug/19 09:25

Updated:: 30/Nov/21 20:38

Resolved:: 09/Mar/21 01:51