Priority: Not a Priority
Affects Version/s: 1.8.0, 1.8.1
Fix Version/s: None
So far only reliably reproducible on a 4-node cluster with parallelism ≥ 100. But do try https://github.com/jcaesar/flink-rocksdb-file-leak
A RocksDB state backend with HDFS checkpoints, with or without local recovery, may leak files in io.tmp.dirs on checkpoint expiry by timeout.
If the size of a checkpoint crosses what can be transferred during one checkpoint timeout, checkpoints will continue to fail forever. If this is combined with a quick rollover of SST files (e.g. due to a high density of writes), this may quickly exhaust available disk space (or memory, as /tmp is the default location).
As a workaround, the jobmanager's REST API can be frequently queried for failed checkpoints, and associated files deleted accordingly.
I've tried investing the cause a little bit, but I'm stuck:
- Checkpoint 19 of job ac7efce3457d9d73b0a4f775a6ef46f8 expired before completing. and similar gets printed, so
- abortExpired is invoked, so
- dispose is invoked, so
- cancelCaller is invoked, so
- the canceler is invoked (through one more layer), so
- cleanup is invoked, (possibly not from cancel), so
- cleanupProvidedResources is invoked (this is the indirection that made me give up), so
- this trace log should be printed, but it isn't.
I have some time to further investigate, but I'd appreciate help on finding out where in this chain things go wrong.