With cached RDDs, Spark can be used for online analytics where it is used to respond to online queries. But loss of RDD partitions due to node/executor failures can cause huge delays in such use cases as the data would have to be regenerated.
Cached RDDs, even when using multiple replicas per block, are not currently resilient to node failures when multiple executors are started on the same node. Block replication currently chooses a peer at random, and this peer could also exist on the same host.
This effort would add topology aware replication to Spark that can be enabled with pluggable strategies. For ease of development/review, this is being broken down to three major work-efforts:
1. Making peer selection for replication pluggable
2. Providing pluggable implementations for providing topology and topology aware replication
3. Pro-active replenishment of lost blocks