Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-12863

Race condition between slot offerings and AllocatedSlotReport

    XMLWordPrintableJSON

    Details

      Description

      With FLINK-11059 we introduced the AllocatedSlotReport which is used by the TaskExecutor to synchronize its internal view on slot allocations with the view of the JobMaster. It seems that there is a race condition between offering slots and receiving the report because the AllocatedSlotReport is sent by the HeartbeatManagerSenderImpl from a separate thread.

      Due to that it can happen that we generate an AllocatedSlotReport just before getting new slots offered. Since the report is sent from a different thread, it can then happen that the response to the slot offerings is sent earlier than the AllocatedSlotReport. Consequently, we might receive an outdated slot report on the TaskExecutor causing active slots to be released.

      In order to solve the problem I propose to add a fencing token to the AllocatedSlotReport which is being updated whenever we offer new slots to the JobMaster. When we receive the AllocatedSlotReport on the TaskExecutor we compare the current slot report fencing token with the received one and only process the report if they are equal. Otherwise we wait for the next heartbeat to send us an up to date AllocatedSlotReport.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                till.rohrmann Till Rohrmann
                Reporter:
                till.rohrmann Till Rohrmann
              • Votes:
                0 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m