[FLINK-13184] Starting a TaskExecutor blocks the YarnResourceManager's main thread - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.8.1, 1.9.0, 1.10.0
Fix Version/s: 1.8.3, 1.9.2, 1.10.0
Component/s: Deployment / YARN
Labels:
- pull-request-available

Description

Currently, YarnResourceManager starts all task executors in main thread. This could cause RM to become unresponsive when launching a large number of TEs (e.g. > 1000) because it involves blocking I/O operations (writing files to HDFS, communicating with the node manager using a synchronous NMClient). As a consequence, TE registration/heartbeat timeouts can occur and Flink might allocate too many excessive containers (see ~~FLINK-12342~~) because it cannot process the YarnResourceManager#onContainersAllocated calls.

There are different solution approaches but the end goal should be to not execute any blocking calls in the ResourceManager's main thread:

1. Start the TaskExecutors from a different thread (potentially thread pool) which is responsible for uploading the files and communicating with the NodeManager
2. Don't upload files (avoid blocking file system operations) and use the NMClientAsync for the communication with Yarn's NodeManager.
3. Upload files in a separate I/O thread and use the NMClientAsync for the communication with Yarn's NodeManager.

Attachments

Issue Links

causes

FLINK-12342 Yarn Resource Manager Acquires Too Many Containers

Resolved

is duplicated by

FLINK-13590 flink-on-yarn sometimes could create many little files that are xxx-taskmanager-conf.yaml

Closed

FLINK-9410 Replace NMClient with NMClientAsync in YarnResourceManager

Closed

FLINK-14582 Do not upload {uuid}-taskmanager-conf.yaml for each task manager container

Closed

is related to

FLINK-15053 Configurations with values contains space may cause TM failures on Yarn

Resolved

links to

GitHub Pull Request #10143

GitHub Pull Request #10259

GitHub Pull Request #10260

(3 links to)

Activity

People

Assignee:: Yang Wang

Reporter:: Xintong Song

Votes:: 2 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 10/Jul/19 03:30

Updated:: 28/Feb/20 10:46

Resolved:: 20/Nov/19 09:03

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 20m