We aim to reduce the state configuration complexity by applying a default configuration with robust settings, based on lessons learned in Flink.
(1) Always use RocksDB.
That is already the case.
We keep this for now, as long as the only other alternative are backends with Objects on the heap, which are tricky in terms of predictable JVM performance. RocksDB has a significant performance cost, but more robust behavior.
(2) Activate local recovery by default.
That makes recovery cheao for soft tasks failures and gracefully cancelled tasks.
We need to set these options:
- state.backend.local-recovery: true
- taskmanager.state.local.root-dirs: <dir> - some local directory that will not possibly be wiped by the OS periodically, so typically some local directory that is not /tmp, for example /local/state/recovery.
- state.backend.rocksdb.localdir: <dir> - a directory on the same FS / device as above, so that one can create hard links between them (required for RocksDB local checkpoints), for example /local/state/rocksdb.
Flink will most likely adopt this as a default setting as well in the future.
It still makes sense to pre-configer a different RocksDB working directory than /tmp.
(3) Activate partitioned indexes by default.
This may cost minimal performance in some cases, but can avoid massive performance regression in cases where the index blocks no longer fit into the memory cache (may happen more frequently when there are too many ColumnFamilies = states).
Set state.backend.rocksdb.memory.partitioned-index-filters: true.
FLINK-20496 for details.
(4) Increase number of transfer threads by default.
This speeds up state recovery in many cases. The default value in Flink is a bit conservative, to avoid spamming DFS (like HDFS) by default. The more cloud-centric StateFun setups should be safe to use higher default value.
Set state.backend.rocksdb.checkpoint.transfer.thread.num: 8.
(5) Increase RocksDB compaction threads by default.
The number of RocksDB compaction threads is frequently a bottleneck.
Increasing it costs virtually nothing and mitigates that bottleneck in most cases.
(this value is chosen under the assumption that there is only one slot per TM in StateFun).