Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-24005

Resource requirements declaration may be incorrect if JobMaster disconnects with a TaskManager with available slots in the SlotPool

    XMLWordPrintableJSON

Details

    Description

      When a TaskManager disconnects with JobMaster, it will trigger the `DeclarativeSlotPoolService#decreaseResourceRequirementsBy()` for all the slots that are registered to the JobMaster from the TaskManager. If the slots are still available, i.e. not assigned to any task, the `decreaseResourceRequirementsBy` may lead to incorrect resource requirements declaration.

      For example, there is one job with 3 source tasks only. It requires 3 slots and declares for 3 slots. Initially all the tasks are running. Suddenly one task failed and waits for some delay before restarting. The previous slot is returned to the SlotPool. Now the job requires 2 slots and declares for 2 slots. At this moment, the TaskManager of that returned slot get lost. After the triggered `decreaseResourceRequirementsBy`, the job only declares for 1 slot. Finally, when the failed task starts to re-schedule, the job will declare for 2 slots while it actually needs 3 slots.

      The attached log of a real job and logs of the added test in https://github.com/zhuzhurk/flink/commit/59ca0ac5fa9c77b97c6e8a43dcc53ca8a0ad6c37 can demonstrate this case.
      Note that the real job is configured with a large "restart-strategy.fixed-delay.delay" and and large "slot.idle.timeout". So possibly in production it is a rare case.

      Attachments

        Issue Links

          Activity

            People

              chesnay Chesnay Schepler
              zhuzh Zhu Zhu
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: