Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-9635

Local recovery scheduling can cause spread out of tasks

    XMLWordPrintableJSON

Details

    • Hide
      With the improvements to Flink's scheduling, it can no longer happen that recoveries require more slots than before if local recovery is enabled. Consequently, we encourage our users to use the local recovery feature which can be enabled by `state.backend.local-recovery: true`.
      Show
      With the improvements to Flink's scheduling, it can no longer happen that recoveries require more slots than before if local recovery is enabled. Consequently, we encourage our users to use the local recovery feature which can be enabled by `state.backend.local-recovery: true`.

    Description

      In order to make local recovery work, Flink's scheduling was changed such that it tries to be rescheduled to its previous location. In order to not occupy slots which have state of other tasks cached, the strategy will request a new slot if the old slot identified by the previous allocation id is no longer present. This also applies to newly allocated slots because there is no distinction between new or already used. This behaviour can cause that every tasks gets deployed to its own slot if the SlotPool has released all slots in the meantime, for example. The consequence could be that a job can no longer be executed after a failure because it needs more slots than before.

      Attachments

        Issue Links

          Activity

            People

              srichter Stefan Richter
              trohrmann Till Rohrmann
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: