Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-11937

Resolve small file problem in RocksDB incremental checkpoint

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      Currently when incremental checkpoint is enabled in RocksDBStateBackend a separate file will be generated on DFS for each sst file. This may cause “file flood” when running intensive workload (many jobs with high parallelism) in big cluster. According to our observation in Alibaba production, such file flood introduces at lease two drawbacks when using HDFS as the checkpoint storage FileSystem: 1) huge number of RPC request issued to NN which may burst its response queue; 2) huge number of files causes big pressure on NN’s on-heap memory.

      In Flink we ever noticed similar small file flood problem and tried to resolved it by introducing ByteStreamStateHandle(FLINK-2808), but this solution has its limitation that if we configure the threshold too low there will still be too many small files, while if too high the JM will finally OOM, thus could hardly resolve the issue in case of using RocksDBStateBackend with incremental snapshot strategy.

      We propose a new OutputStream called FileSegmentCheckpointStateOutputStream(FSCSOS) to fix the problem. FSCSOS will reuse the same underlying distributed file until its size exceeds a preset threshold. We
      plan to complete the work in 3 steps: firstly introduce FSCSOS, secondly resolve the specific storage amplification issue on FSCSOS, and lastly add an option to reuse FSCSOS across multiple checkpoints to further reduce the DFS file number.

      More details please refer to the attached design doc.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            klion26 Congxian Qiu

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 0.5h
                0.5h

                Slack

                  Issue deployment