Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-43421

Implement changelog checkpointing for RocksDB state store

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 3.5.0
    • Structured Streaming
    • None

    Description

      We have identified state checkpointing latency as one of the major performance bottlenecks for stateful streaming queries. Currently, RocksDB state store pauses the RocksDB instances to upload a snapshot to the cloud when committing a batch, which is heavy weight and has unpredictable performance.

      In order to reduce the checkpoint duration and end to end latency, we propose to
      1. During state commit, make the state of a microbatch durable by syncing the changelog instead of the state snapshot to the checkpoint directory.
      2. Upload snapshot in the background to enable changelog purging and faster failure recovery.

      In this way, we allow the RocksDB instance to run uninterruptibly, which improves RocksDB operation performance. This also dramatically reduces the commit time and batch duration because we are uploading a smaller amount of data during state commit. With this change, stateful query with RocksDB state store will have lower and more predictable latency.

      Attachments

        Activity

          People

            Chaoqin Chaoqin Li
            Chaoqin Chaoqin Li
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: