Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-13808

Checkpoints expired by timeout may leak RocksDB files

    XMLWordPrintableJSON

    Details

      Description

      A RocksDB state backend with HDFS checkpoints, with or without local recovery, may leak files in io.tmp.dirs on checkpoint expiry by timeout.

      If the size of a checkpoint crosses what can be transferred during one checkpoint timeout, checkpoints will continue to fail forever. If this is combined with a quick rollover of SST files (e.g. due to a high density of writes), this may quickly exhaust available disk space (or memory, as /tmp is the default location).

      As a workaround, the jobmanager's REST API can be frequently queried for failed checkpoints, and associated files deleted accordingly.

      I've tried investing the cause a little bit, but I'm stuck:

      I have some time to further investigate, but I'd appreciate help on finding out where in this chain things go wrong.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                Caesar Julius Michaelis
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated: