Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-19832

Improve handling of immediately failed physical slot in SlotSharingExecutionSlotAllocator

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

      Description

      Improve handling of immediately failed physical slot in SlotSharingExecutionSlotAllocator

      If a physical slot future the immediately fails for a new SharedSlot in SlotSharingExecutionSlotAllocator#getOrAllocateSharedSlot but we continue to add logical slots to this SharedSlot, eventually, the logical slot also fails and gets removed from the SharedSlot which gets released (state RELEASED). The subsequent logical slot addings in the loop of allocateLogicalSlotsFromSharedSlots will fail the scheduling
      with the ALLOCATED state check because it will be RELEASED.

      The subsequent bulk timeout check will also not find the SharedSlot and fail with NPE.

      Hence, such SharedSlot with the immediately failed physical slot future should not be kept in the SlotSharingExecutionSlotAllocator and the logical slot requests depending on it can be immediately returned failed. The bulk timeout check does not need to be started because if some physical (and its logical) slot requests failed then the whole bulk will be canceled by scheduler.

      If the last assumption is not true for the future scheduling, this bulk failure might need additional explicit pending requests cancelation. We expect to refactor it for the declarative scheduling anyways.

        Attachments

        Issue Links

          Activity

            People

            • Assignee:
              azagrebin Andrey Zagrebin
              Reporter:
              azagrebin Andrey Zagrebin

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment