[SPARK-8666] checkpointing does not take advantage of persisted/cached RDDs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

I have been noticing that when checkpointing RDDs, all operations are occurring TWICE.

For example, when I run the following code and watch the stages...

val newRDD = prevRDD.map(a => (a._1, 1L)).distinct.persist(StorageLevel.MEMORY_AND_DISK_SER)
newRDD.checkpoint
print(newRDD.count())

I see distinct and count operations appearing TWICE, and shuffle disk writes and reads (from the distinct) occurring TWICE.

My newRDD is persisted to memory, why can't the checkpoint simply save those partitions to disk when the first operations have completed?

Attachments

Issue Links

duplicates

SPARK-8582 Optimize checkpointing to avoid computing an RDD twice

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Glenn Strycker

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 26/Jun/15 16:42

Updated:: 29/Jun/15 13:39

Resolved:: 27/Jun/15 05:29