Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-21726

Fix checkpoint stuck

    XMLWordPrintableJSON

Details

    Description

      1. Bug description:

      When RocksDB Checkpoint, it may be stuck in `WaitUntilFlushWouldNotStallWrites` method.

      2. Simple analysis of the reasons:

      2.1 Configuration parameters:

       

      # Flink yaml:
      state.backend.rocksdb.predefined-options: SPINNING_DISK_OPTIMIZED_HIGH_MEM
      state.backend.rocksdb.compaction.style: UNIVERSAL
      
      
      # corresponding RocksDB config
      Compaction Style : Universal 
      
      max_write_buffer_number : 4
      min_write_buffer_number_to_merge : 3

      Checkpoint is usually very fast. When the Checkpoint is executed, `WaitUntilFlushWouldNotStallWrites` is called. If there are 2 Immutable MemTables, which are less than `min_write_buffer_number_to_merge`, they will not be flushed. But will enter this code.

       

      // method: GetWriteStallConditionAndCause
      if (mutable_cf_options.max_write_buffer_number> 3 &&
                    num_unflushed_memtables >=
                        mutable_cf_options.max_write_buffer_number-1) {
           return {WriteStallCondition::kDelayed, WriteStallCause::kMemtableLimit};
      }
      

      code link: https://github.com/facebook/rocksdb/blob/fbed72f03c3d9e4fdca3e5993587ef2559ba6ab9/db/column_family.cc#L847

      Checkpoint thought there was a FlushJob, but it didn't. So will always wait.

      2.2 solution:

      Increase the restriction: the `number of Immutable MemTable` >= `min_write_buffer_number_to_merge will wait`.

      The rocksdb community has fixed this bug, link: https://github.com/facebook/rocksdb/pull/7921

      2.3 Code that can reproduce the bug:

      https://github.com/1996fanrui/fanrui-learning/blob/flink-1.12/module-java/src/main/java/com/dream/rocksdb/RocksDBCheckpointStuck.java

      3. Interesting point

      This bug will be triggered only when `the number of sorted runs >= level0_file_num_compaction_trigger`.

      Because there is a break in WaitUntilFlushWouldNotStallWrites.

      if (cfd->imm()->NumNotFlushed() <
              cfd->ioptions()->min_write_buffer_number_to_merge &&
          vstorage->l0_delay_trigger_count() <
              mutable_cf_options.level0_file_num_compaction_trigger) {
        break;
      }
      

      code link: https://github.com/facebook/rocksdb/blob/fbed72f03c3d9e4fdca3e5993587ef2559ba6ab9/db/db_impl/db_impl_compaction_flush.cc#L1974

      Universal may have `l0_delay_trigger_count() >= level0_file_num_compaction_trigger`, so this bug is triggered.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              fanrui Rui Fan
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: