Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-20988 Various resource allocation improvements for active resource managers
  3. FLINK-13554

ResourceManager should have a timeout on starting new TaskExecutors.

    XMLWordPrintableJSON

Details

    Description

      Recently, we encountered a case that one TaskExecutor get stuck during launching on Yarn (without fail), causing that job cannot recover from continuous failovers.

      The reason the TaskExecutor gets stuck is due to our environment problem. The TaskExecutor gets stuck somewhere after the ResourceManager starts the TaskExecutor and waiting for the TaskExecutor to be brought up and register. Later when the slot request timeouts, the job fails over and requests slots from ResourceManager again, the ResourceManager still see a TaskExecutor (the stuck one) is being started and will not request new container from Yarn. Therefore, the job can not recover from failure.

      I think to avoid such unrecoverable status, the ResourceManager need to have a timeout on starting new TaskExecutor. If the starting of TaskExecutor takes too long, it should just fail the TaskExecutor and starts a new one.

      Attachments

        Issue Links

          Activity

            People

              xtsong Xintong Song
              xtsong Xintong Song
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: