Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-45415

RocksDB consumes excessive disk space when many concurrent streaming queries are using dedup

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.3.2
    • 4.0.0
    • Structured Streaming
    • Apache spark on AWS EMR, local spark on a laptop on Linux.¬† Does not impact MacOS due to missing support in RocksDB for pre-allocation (MacOS does not support the fallocate system call).

    Description

      Our spark environment features a number of parallel structured streaming jobs, many of which use state store. Most use state store for dropDuplicates and work with a tiny amount of information, but a few have a substantially large state store requiring use of RocksDB. In such a configuration, spark allocates a minimum of spark.sql.shuffle.partitions * queryCount partitions, each of which pre-allocate about 74mb (observed on EMR/Hadoop) disk storage for RocksDB. This allocation is due to pre-allocation of log files space using fallocate, requiring users to either unnaturally reduce shuffle partitions, split running spark instances, or allocate a large amount of wasted storage.

      Attachments

        Issue Links

          Activity

            People

              schenksj Scott Schenkein
              schenksj Scott Schenkein
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 4h
                  4h
                  Remaining:
                  Remaining Estimate - 4h
                  4h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified