Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
1.8.1, 1.9.0, 1.10.0
Description
Currently, YarnResourceManager starts all task executors in main thread. This could cause RM to become unresponsive when launching a large number of TEs (e.g. > 1000) because it involves blocking I/O operations (writing files to HDFS, communicating with the node manager using a synchronous NMClient). As a consequence, TE registration/heartbeat timeouts can occur and Flink might allocate too many excessive containers (see FLINK-12342) because it cannot process the YarnResourceManager#onContainersAllocated calls.
There are different solution approaches but the end goal should be to not execute any blocking calls in the ResourceManager's main thread:
1. Start the TaskExecutors from a different thread (potentially thread pool) which is responsible for uploading the files and communicating with the NodeManager
2. Don't upload files (avoid blocking file system operations) and use the NMClientAsync for the communication with Yarn's NodeManager.
3. Upload files in a separate I/O thread and use the NMClientAsync for the communication with Yarn's NodeManager.
Attachments
Issue Links
- causes
-
FLINK-12342 Yarn Resource Manager Acquires Too Many Containers
- Resolved
- is duplicated by
-
FLINK-13590 flink-on-yarn sometimes could create many little files that are xxx-taskmanager-conf.yaml
- Closed
-
FLINK-9410 Replace NMClient with NMClientAsync in YarnResourceManager
- Closed
-
FLINK-14582 Do not upload {uuid}-taskmanager-conf.yaml for each task manager container
- Closed
- is related to
-
FLINK-15053 Configurations with values contains space may cause TM failures on Yarn
- Resolved
- links to