Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-35897

Some checkpoint files and localState files can't be cleanUp when checkpoint is aborted

    XMLWordPrintableJSON

Details

    Description

      Problem

      When the job checkpoint is canceled (asyncsnapshotcallable.java/#L129]), it is still possible for the asynchronous snapshot thread to continue executing and generate a completed checkpoint (RocksIncrementalSnapshotStrategy.java#L324]). In this case, there will be no role is responsible for the completed checkpoint cleanup, neither async snapshot thread, nor SubtaskCheckpointCoordinatorImpl.

      How to reproduce it

      We can reproduce this issue by running the [DataGenWordCount example in my debug branch|https://github.com/ljz2051/flink/commit/33c0c55098a49a0b56c9404256a560da5069f26c], in which I've added some debug code. 

      How to fix it

      When the asynchronous snapshot thread completes a checkpoint, it needs to cleanup the completed checkpoint if it finds that the checkpoint has been canceled.

      Attachments

        Issue Links

          Activity

            People

              lijinzhong Jinzhong Li
              lijinzhong Jinzhong Li
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: