[SPARK-45415] RocksDB consumes excessive disk space when many concurrent streaming queries are using dedup - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.3.2
Fix Version/s: 4.0.0
Component/s: Structured Streaming
Labels:
- pull-request-available
Environment:

Apache spark on AWS EMR, local spark on a laptop on Linux. Does not impact MacOS due to missing support in RocksDB for pre-allocation (MacOS does not support the fallocate system call).

Description

Our spark environment features a number of parallel structured streaming jobs, many of which use state store. Most use state store for dropDuplicates and work with a tiny amount of information, but a few have a substantially large state store requiring use of RocksDB. In such a configuration, spark allocates a minimum of spark.sql.shuffle.partitions * queryCount partitions, each of which pre-allocate about 74mb (observed on EMR/Hadoop) disk storage for RocksDB. This allocation is due to pre-allocation of log files space using fallocate, requiring users to either unnaturally reduce shuffle partitions, split running spark instances, or allocate a large amount of wasted storage.

Attachments

Issue Links

links to

GitHub Pull Request #43202

Activity

People

Assignee:: Scott Schenkein

Reporter:: Scott Schenkein

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 04/Oct/23 13:44

Updated:: 11/Oct/23 23:46

Resolved:: 11/Oct/23 23:46

Time Tracking

Estimated:

Remaining:

Logged:

Not Specified