Affects Version/s: 1.1.0
Fix Version/s: None
We should include general documentation on using checkpoints. Right now the docs only cover checkpoints in the Spark Streaming use case which is slightly different from Core.
Some content to consider for inclusion from Burak Yavuz:
If you set the checkpointing directory however, the intermediate state of the RDDs will be saved in HDFS, and the lineage will pick off from there.
You won't need to keep the shuffle data before the checkpointed state, therefore those can be safely removed (will be removed automatically).
However, checkpoint must be called explicitly as in https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L291 ,just setting the directory will not be enough.
Yes, writing to HDFS is more expensive, but I feel it is still a small price to pay when compared to having a Disk Space Full error three hours in
and having to start from scratch.
The main goal of checkpointing is to truncate the lineage. Clearing up shuffle writes come as a bonus to checkpointing, it is not the main goal. The
subtlety here is that .checkpoint() is just like .cache(). Until you call an action, nothing happens. Therefore, if you're going to do 1000 maps in a
row and you don't want to checkpoint in the meantime until a shuffle happens, you will still get a StackOverflowError, because the lineage is too long.
I went through some of the code for checkpointing. As far as I can tell, it materializes the data in HDFS, and resets all its dependencies, so you start
a fresh lineage. My understanding would be that checkpointing still should be done every N operations to reset the lineage. However, an action must be
performed before the lineage grows too long.
A good place to put this information would be at https://spark.apache.org/docs/latest/programming-guide.html