Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-13184

Starting a TaskExecutor blocks the YarnResourceManager's main thread

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

      Description

      Currently, YarnResourceManager starts all task executors in main thread. This could cause RM to become unresponsive when launching a large number of TEs (e.g. > 1000) because it involves blocking I/O operations (writing files to HDFS, communicating with the node manager using a synchronous NMClient). As a consequence, TE registration/heartbeat timeouts can occur and Flink might allocate too many excessive containers (see FLINK-12342) because it cannot process the YarnResourceManager#onContainersAllocated calls.

      There are different solution approaches but the end goal should be to not execute any blocking calls in the ResourceManager's main thread:

      1. Start the TaskExecutors from a different thread (potentially thread pool) which is responsible for uploading the files and communicating with the NodeManager
      2. Don't upload files (avoid blocking file system operations) and use the NMClientAsync for the communication with Yarn's NodeManager.
      3. Upload files in a separate I/O thread and use the NMClientAsync for the communication with Yarn's NodeManager.

        Attachments

        Issue Links

          Activity

            People

            • Assignee:
              fly_in_gis Yang Wang
              Reporter:
              xintongsong Xintong Song

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1h 20m
                1h 20m

                  Issue deployment