Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 1.15.0
    • Labels:
      None

      Description

      FLINK-23139 adds private state management capabilities to TM.

      However, it does not consider multiple retained checkpoints. 

      In most cases, it should work correctly:

      1. TMs will not discard the state of the previous checkpoints if it's not used in the latest one - becase they are not aware of it
      2. If some state is reused (incremental checkpoints); it will be discarded by TMs on latest checkpoint subsumption - which means that the previous checkpoints are subsumed too

      However, JM will also try to discard the state on subsumption (it's not shared between TMs).

      So, the state will be removed, it will not be removed prematurely, but there can be race conditions.

      The simplest way to solve this is to ignore (log) discard errors.

      Other options include:

      1. treat all state after recovery as "distributed", so TMs won't discard it
      2. compute the intersection between the checkpoints and pass it to TMs as distributed state, so they won't discard it
      3. compute the intersection between the checkpoints and prevent JM from discarding it
      4. pass all recovered snapshots from JM to TM on recovery

        Attachments

        Issue Links

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              roman Roman Khachatryan

              Dates

              • Created:
                Updated:

                Issue deployment