Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-13184

Starting a TaskExecutor blocks the YarnResourceManager's main thread

    XMLWordPrintableJSON

    Details

      Description

      Currently, YarnResourceManager starts all task executors in main thread. This could cause RM to become unresponsive when launching a large number of TEs (e.g. > 1000) because it involves blocking I/O operations (writing files to HDFS, communicating with the node manager using a synchronous NMClient). As a consequence, TE registration/heartbeat timeouts can occur and Flink might allocate too many excessive containers (see FLINK-12342) because it cannot process the YarnResourceManager#onContainersAllocated calls.

      There are different solution approaches but the end goal should be to not execute any blocking calls in the ResourceManager's main thread:

      1. Start the TaskExecutors from a different thread (potentially thread pool) which is responsible for uploading the files and communicating with the NodeManager
      2. Don't upload files (avoid blocking file system operations) and use the NMClientAsync for the communication with Yarn's NodeManager.
      3. Upload files in a separate I/O thread and use the NMClientAsync for the communication with Yarn's NodeManager.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                fly_in_gis Yang Wang
                Reporter:
                xintongsong Xintong Song
              • Votes:
                2 Vote for this issue
                Watchers:
                13 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 20m
                  1h 20m