Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-10429 Redesign Flink Scheduling, introducing dedicated Scheduler component
  3. FLINK-14859

Avoid leaking unassigned Slot in DefaultScheduler when Deployment is outdated

    XMLWordPrintableJSON

Details

    Description

      In DefaultScheduler#assignResourceOrHandleError(), if the deployment is outdated, we should release the possibly acquired LogicalSlot so that we do not leak resources.

      Below is an example to illustrate how slot leak is currently possible:

      1. Vertices A1, A2, A3 are scheduled in a batch.
      2. A2 acquires a slot. A1, A3 do not.
      3. A1 fails due to slot allocation timeout and triggers failover (DefaultScheduler#cancelTasksAsync)
      4. A2 is canceled first and its returned slot is assigned to A3, which triggers DefaultScheduler#assignResourceOrHandleError of A3.
        However, A3 is not canceled yet but it is outdated because executionVertexVersioner#recordVertexModifications was already invoked

      Attachments

        Activity

          People

            gjy Gary Yao
            gjy Gary Yao
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 40m
                40m