Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-9351

RM stop assigning slot to Job because the TM killed before connecting to JM successfully

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Duplicate
    • 1.5.0
    • 1.6.0
    • Runtime / Coordination
    • None

    Description

      The steps are the following(copied from Stephan's comments in 5931):

      • JobMaster / SlotPool requests a slot (AllocationID) from the ResourceManager
      • ResourceManager starts a container with a TaskManager
      • TaskManager registers at ResourceManager, which tells the TaskManager to push a slot to the JobManager.
      • TaskManager container is killed
      • The ResourceManager does not queue back the slot requests (AllocationIDs) that it sent to the previous TaskManager, so the requests are lost and need to time out before another attempt is tried.

      Attachments

        Issue Links

          Activity

            People

              sihuazhou Sihua Zhou
              sihuazhou Sihua Zhou
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: