Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-25842 [v2] FLIP-158: Generalized incremental checkpoints
  3. FLINK-28172

Scatter dstl files into separate directories by job id

    XMLWordPrintableJSON

Details

    Description

      In the current implementation of FsStateChangelogStorage, dstl files from all jobs are put into the same directory (configured via dstl.dfs.base-path). Everything is fine if it's a filesystem like S3.But if it is a file system like hadoop, there will be some problems.

      First, there may be an upper limit to the number of files in a single directory. Increasing this threshold will greatly reduce the performance of the distributed file system.

      Second, dstl file management becomes difficult because the user cannot tell which job the dstl file belongs to, especially when the retained checkpoint is turned on.

      Propose

      1. create a subdirectory named with the job id under the dstl.dfs.base-path directory when the job starts
      2. all dstl files upload to the subdirectory

      ( Going a step further, we can even create two levels of subdirectories under the dstl.dfs.base-path directory, like base-path/{jobId}/dstl . This way, if the user configures the same dstl.dfs.base-path as state.checkpoints.dir, all files needed for job recovery will be in the same directory and well organized. )

      Attachments

        Issue Links

          Activity

            People

              Feifan Wang Feifan Wang
              Feifan Wang Feifan Wang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: