[SPARK-14289] Support multiple eviction strategies for cached RDD partitions - ASF JIRA

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Minor
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: Block Manager, Spark Core
Labels:
None
Environment:

Spark 2.0-SNAPSHOT
Single Rack
Standalone mode scheduling
8 node cluster
16 cores & 64G RAM / node
Data Replication factor of 3

Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM.

Description

Currently, there is only eviction strategy for cached RDD partition in Spark.

The default RDD eviction strategy is LRU (with an additional rule that do not replacing another block that belongs to the same RDD like current creating partition).

When memory space not sufficient for RDD caching, several partitions will be evicted, if these partitions are used again latterly, they will be reproduced by the Lineage information and cached in memory again. The reproduce phase will bring in additional cost. However, LRU has no guarantee for the lowest reproduce cost.

The first RDD that needed to be cached is usually generated by reading from HDFS and doing several transformations. The reading operation usually cost longer time than other Spark transformations.

For example, in one stage we having the following DAG structure: hdfs -> [A] -> B -> [C] -> D - > [E] -> [F], RDD A, C, E, F needed to be cached in memory, F is creating during this stage while A, B and E had already been created in previous. When using the LRU eviction strategy, partition of A will be evicted first. However, the time cost in\ [A] -> B -> [C] may be much less than hdfs ->\ [A], so evict [C] may be better than evict [A].

A eviction strategy based on the creation cost may be better than LRU, by statisticing each transformation's time during the creation of cached RDD partition (e.g. [E] only need to statistic time cost in [C] -> D and D -> [E]) and time cost in needed shuffle reading. When memory for RDD storage not sufficient, partition with the least creation cost may be evicted first. So this strategy for be called as LCS. My current demo show better performance gain than default LRU.

This strategy needs to consider the following situation:
1. Unified Memory Management is provided after Spark 1.6, memory for execution during recomputing a partition may be pretty different than the first time the partition created. So before better thought, LCS may not be allowed in UMM mode. (Though my demo also show improvement in LCS than LRU in UMM mode).

2. MEMORY_AND_DISK_SER or other similar storage level may serialize RDD partition. By estimating ser/deserialize cost and compare to creation cost, if the ser/deserialize cost even larger than recreation, not serialize but directly removed from memory. As existing storage level only allowed for the whole RDD, so a new storage level may be needed for RDD partition to directly determine whether to serialize or just remove from memory.

Besides LCS, FIFO or LFU is easy to be implemented.