Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-12865

State inconsistency between RM and TM on the slot status

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      There may be state inconsistency between TM and RM due to race condition and message loss:

      1. When TM sends heartbeat, it retrieve SlotReport in the main thread, but sends the heartbeat in another thread. There may be cases that the slot on TM is FREE initially and SlotReport read the FREE state, then RM requests slot and mark the slot as allocated, and the SlotReport finally override the allocated status at the RM side wrongly.
      2. When RM requests slot, TM received the requests but the acknowledge message get lot. Then RM will think this slot is free. 

       Both the problems may cause RM marks an ALLOCATED slot as FREE. This may currently cause additional retries till the state is synchronized after the next heartbeat, and for the inaccurate resource statistics for the fine-grained resource management in the future.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            trohrmann Till Rohrmann
            gaoyunhaii Yun Gao
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Issue deployment