Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-32070

FLIP-306 Unified File Merging Mechanism for Checkpoints

    XMLWordPrintableJSON

Details

    Description

      The FLIP: https://cwiki.apache.org/confluence/display/FLINK/FLIP-306%3A+Unified+File+Merging+Mechanism+for+Checkpoints

       

      The creation of multiple checkpoint files can lead to a 'file flood' problem, in which a large number of files are written to the checkpoint storage in a short amount of time. This can cause issues in large clusters with high workloads, such as the creation and deletion of many files increasing the amount of file meta modification on DFS, leading to single-machine hotspot issues for meta maintainers (e.g. NameNode in HDFS). Additionally, the performance of object storage (e.g. Amazon S3 and Alibaba OSS) can significantly decrease when listing objects, which is necessary for object name de-duplication before creating an object, further affecting the performance of directory manipulation in the file system's perspective of view (See hadoop-aws module documentation, section 'Warning #2: Directories are mimicked').

      While many solutions have been proposed for individual types of state files (e.g. FLINK-11937 for keyed state (RocksDB) and FLINK-26803 for channel state), the file flood problems from each type of checkpoint file are similar and lack systematic view and solution. Therefore, the goal of this FLIP is to establish a unified file merging mechanism to address the file flood problem during checkpoint creation for all types of state files, including keyed, non-keyed, channel, and changelog state. This will significantly improve the system stability and availability of fault tolerance in Flink.

      Attachments

        Issue Links

          1.
          Implement the snapshot manager for merged checkpoint files in TM Sub-task Closed Zakelly Lan
          2.
          Create and wire FileMergingSnapshotManager with TaskManagerServices Sub-task Closed Yanfei Lei
          3.
          Implement file merging in snapshot Sub-task Closed Han Yin
          4.
          Delete merged files on checkpoint abort or subsumption Sub-task Resolved Zakelly Lan
          5.
          Support file merging across checkpoints Sub-task Resolved Zakelly Lan
          6.
          Report State handle of file merging directory to JM Sub-task Closed Yanfei Lei
          7.
          Add file pool for concurrent file reusing Sub-task Resolved Hangxiang Yu
          8.
          Implement shared state file merging Sub-task Closed Zakelly Lan
          9.
          Implement private state file merging Sub-task Closed Yanfei Lei
          10.
          Read/write checkpoint metadata of merged files Sub-task Resolved Hangxiang Yu
          11.
          Register reused state handles to FileMergingSnapshotManager Sub-task Resolved Zakelly Lan
          12.
          Restoration of FileMergingSnapshotManager Sub-task In Progress Jinzhong Li
          13.
          Compatibility between file-merging on and off across job runs Sub-task Open Jinzhong Li
          14.
          Documentation of checkpoint file-merging Sub-task Open Yanfei Lei
          15.
          Chinese translation of documentation of checkpoint file-merging Sub-task Open Hangxiang Yu
          16.
          Migrate current file merging of channel state into the file merging framework Sub-task In Progress Yanfei Lei
          17.
          Implement and migrate batch uploading in changelog files into the file merging framework Sub-task Open Hangxiang Yu
          18.
          Cleanup non-reported managed directory on exit of TM Sub-task Open Zakelly Lan
          19.
          Space amplification statistics of file merging Sub-task Open Rui Xia
          20.
          Introduce file merging configuration Sub-task Resolved Yanfei Lei
          21.
          Re-uploading in state file-merging for space amplification control Sub-task Open Han Yin
          22.
          Do fast copy in best-effort during first checkpoint after restoration Sub-task Open Yanfei Lei
          23.
          Python API for enabling and configuring file merging snapshot Sub-task Open Yanfei Lei
          24.
          Add necessary metrics for file-merging Sub-task Open Hangxiang Yu
          25.
          Integrate snapshot file-merging with existing IT cases Sub-task Open Rui Xia

          Activity

            People

              Zakelly Zakelly Lan
              Zakelly Zakelly Lan
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated: