Currently when incremental checkpoint is enabled in RocksDBStateBackend a separate file will be generated on DFS for each sst file. This may cause “file flood” when running intensive workload (many jobs with high parallelism) in big cluster. According to our observation in Alibaba production, such file flood introduces at lease two drawbacks when using HDFS as the checkpoint storage FileSystem: 1) huge number of RPC request issued to NN which may burst its response queue; 2) huge number of files causes big pressure on NN’s on-heap memory.
In Flink we ever noticed similar small file flood problem and tried to resolved it by introducing ByteStreamStateHandle(
FLINK-2808), but this solution has its limitation that if we configure the threshold too low there will still be too many small files, while if too high the JM will finally OOM, thus could hardly resolve the issue in case of using RocksDBStateBackend with incremental snapshot strategy.
We propose a new OutputStream called FileSegmentCheckpointStateOutputStream(FSCSOS) to fix the problem. FSCSOS will reuse the same underlying distributed file until its size exceeds a preset threshold. We
plan to complete the work in 3 steps: firstly introduce FSCSOS, secondly resolve the specific storage amplification issue on FSCSOS, and lastly add an option to reuse FSCSOS across multiple checkpoints to further reduce the DFS file number.
More details please refer to the attached design doc.