Details
-
New Feature
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
None
-
None
-
None
-
None
Description
I have been noticing that when checkpointing RDDs, all operations are occurring TWICE.
For example, when I run the following code and watch the stages...
val newRDD = prevRDD.map(a => (a._1, 1L)).distinct.persist(StorageLevel.MEMORY_AND_DISK_SER) newRDD.checkpoint print(newRDD.count())
I see distinct and count operations appearing TWICE, and shuffle disk writes and reads (from the distinct) occurring TWICE.
My newRDD is persisted to memory, why can't the checkpoint simply save those partitions to disk when the first operations have completed?
Attachments
Issue Links
- duplicates
-
SPARK-8582 Optimize checkpointing to avoid computing an RDD twice
- Resolved