Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-28984

FsCheckpointStateOutputStream is not being released normally

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      If the checkpoint is aborted, AsyncSnapshotCallable will close the snapshotCloseableRegistry when it is canceled. There may be two situations here:

      1. The FSDataOutputStream has been created and closed while closing FsCheckpointStateOutputStream.
      2. The FSDataOutputStream has not been created yet, but closed flag has been set to true. You can see this in log:
        2022-08-16 12:55:44,161 WARN  org.apache.flink.core.fs.SafetyNetCloseableRegistry           - Closing unclosed resource via safety-net: ClosingFSDataOutputStream(org.apache.flink.runtime.fs.hdfs.HadoopDataOutputStream@4ebe8e64) : xxxxx/flink/checkpoint/state/9214a2e302904b14baf2dc1aacbc7933/ae157c5a05a8922a46a179cdb4c86b10/shared/9d8a1e92-2f69-4ab0-8ce9-c1beb149229a 

              The output stream will be automatically closed by the SafetyNetCloseableRegistry but the file will not be deleted.

      The second case usually occurs when the storage system has high latency in creating files.

      How to reproduce?

      This is not easy to reproduce, but you can try to set a smaller checkpoint timeout and increase the parallelism of the flink job.
       

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            ChangjiGuo ChangjiGuo

            Dates

              Created:
              Updated:

              Slack

                Issue deployment