Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-15456

Job keeps failing on slot allocation timeout due to RM not allocating new TMs for slot requests

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Duplicate
    • 1.10.0
    • 1.10.0
    • Runtime / Coordination
    • None

    Description

      As in the attached JM log, the job tried to start 30 TMs but only 29 are registered. So the job fails due to not able to acquire all 30 slots needed in time.
      And when the failover happens and tasks are re-scheduled, the RM will not ask for new TMs even if it cannot fulfill the slot requests. So the job will keep failing for slot allocation timeout.

      Attachments

        1. jm.log
          1.94 MB
          Zhu Zhu
        2. tm_container_07.log
          224 kB
          Zhu Zhu
        3. jm_part2.log
          2.27 MB
          Zhu Zhu
        4. jm_part.log
          1.94 MB
          Zhu Zhu

        Issue Links

          Activity

            People

              Unassigned Unassigned
              zhuzh Zhu Zhu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: