Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-4317

RocksDB checkpoint files lost on kill -9

    XMLWordPrintableJSON

    Details

      Description

      Right now, the checkpoint files for logged RocksDB stores are written during a graceful shutdown, and removed upon restoration. Unfortunately this means that in a scenario where the process is forcibly killed, the checkpoint files are not there, so all RocksDB stores are rematerialized from scratch on the next launch.

      In a way, this is good, because it simulates bootstrapping a new node (for example, its a good way to see how much I/O is used to rematerialize the stores) however it leads to longer recovery times when a non-graceful shutdown occurs and we want to get the job up and running again.

      It seems that two possible things to consider:

      • Simply do not remove checkpoint files on restoring. This way a kill -9 will result in only repeating the restoration of all the data generated in the source topics since the last graceful shutdown.
      • Continually update the checkpoint files (perhaps on commit) – this would result in the least amount of overhead/latency in restarting, but the additional complexity may not be worth it.

      https://cwiki.apache.org/confluence/display/KAFKA/KIP-116%3A+Add+State+Store+Checkpoint+Interval+Configuration

        Attachments

          Activity

            People

            • Assignee:
              damianguy Damian Guy
              Reporter:
              gfodor Greg Fodor
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: