Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Auto Closed
-
1.1.0
-
None
Description
In spark, rdd can be cached in memory for later use, and the cached memory size is "spark.executor.memory * spark.storage.memoryFraction" for spark version before 1.1.0, and "spark.executor.memory * spark.storage.memoryFraction * spark.storage.safetyFraction" after SPARK-1777.
For Storage level MEMORY_AND_DISK, when free memory is not enough to cache new blocks, old blocks might be dropped to disk to free up memory for new blocks. This operation is processed by ensureFreeSpace in MemoryStore.scala, there will always be a "accountingLock" held by the caller to ensure only one thread is dropping blocks. This method can not fully used the disks throughput when there are multiple disks on the working node. When testing our workload, we found this is really a bottleneck when size of old blocks to be dropped is really large.
We have tested the parallel method on spark 1.0, the speedup is significant. So it's necessary to make dropping blocks operation in parallel.
Attachments
Attachments
Issue Links
- is duplicated by
-
SPARK-1888 enhance MEMORY_AND_DISK mode by dropping blocks in parallel
- Resolved
- links to