[SPARK-3631] Add docs for checkpoint usage - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Documentation
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 1.1.0
Fix Version/s: None
Component/s: Documentation
Labels:
- bulk-closed

Description

We should include general documentation on using checkpoints. Right now the docs only cover checkpoints in the Spark Streaming use case which is slightly different from Core.

Some content to consider for inclusion from brkyvz:

If you set the checkpointing directory however, the intermediate state of the RDDs will be saved in HDFS, and the lineage will pick off from there.
You won't need to keep the shuffle data before the checkpointed state, therefore those can be safely removed (will be removed automatically).
However, checkpoint must be called explicitly as in https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L291 ,just setting the directory will not be enough.

Yes, writing to HDFS is more expensive, but I feel it is still a small price to pay when compared to having a Disk Space Full error three hours in
and having to start from scratch.

The main goal of checkpointing is to truncate the lineage. Clearing up shuffle writes come as a bonus to checkpointing, it is not the main goal. The
subtlety here is that .checkpoint() is just like .cache(). Until you call an action, nothing happens. Therefore, if you're going to do 1000 maps in a
row and you don't want to checkpoint in the meantime until a shuffle happens, you will still get a StackOverflowError, because the lineage is too long.

I went through some of the code for checkpointing. As far as I can tell, it materializes the data in HDFS, and resets all its dependencies, so you start
a fresh lineage. My understanding would be that checkpointing still should be done every N operations to reset the lineage. However, an action must be
performed before the lineage grows too long.

A good place to put this information would be at https://spark.apache.org/docs/latest/programming-guide.html

Attachments

Activity

People

Assignee:: Andrew Ash

Reporter:: Andrew Ash

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 22/Sep/14 01:48

Updated:: 21/May/19 05:37

Resolved:: 21/May/19 05:37