Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-9635

Local recovery scheduling can cause spread out of tasks

    Details

    • Release Note:
      Hide
      With the improvements to Flink's scheduling, it can no longer happen that recoveries require more slots than before if local recovery is enabled. Consequently, we encourage our users to use the local recovery feature which can be enabled by `state.backend.local-recovery: true`.
      Show
      With the improvements to Flink's scheduling, it can no longer happen that recoveries require more slots than before if local recovery is enabled. Consequently, we encourage our users to use the local recovery feature which can be enabled by `state.backend.local-recovery: true`.

      Description

      In order to make local recovery work, Flink's scheduling was changed such that it tries to be rescheduled to its previous location. In order to not occupy slots which have state of other tasks cached, the strategy will request a new slot if the old slot identified by the previous allocation id is no longer present. This also applies to newly allocated slots because there is no distinction between new or already used. This behaviour can cause that every tasks gets deployed to its own slot if the SlotPool has released all slots in the meantime, for example. The consequence could be that a job can no longer be executed after a failure because it needs more slots than before.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                srichter Stefan Richter
                Reporter:
                till.rohrmann Till Rohrmann
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: