Description
Consider the following code:
for (i <- 0 until totalIter) { val previousCorpus = corpus logInfo("Start Gibbs sampling (Iteration %d/%d)".format(i, totalIter)) val corpusTopicDist = collectTermTopicDist(corpus, globalTopicCounter, sumTerms, numTerms, numTopics, alpha, beta).persist(storageLevel) val corpusSampleTopics = sampleTopics(corpusTopicDist, globalTopicCounter, sumTerms, numTerms, numTopics, alpha, beta).persist(storageLevel) corpus = updateCounter(corpusSampleTopics, numTopics).persist(storageLevel) globalTopicCounter = collectGlobalCounter(corpus, numTopics) assert(bsum(globalTopicCounter) == sumTerms) previousCorpus.unpersistVertices() corpusTopicDist.unpersistVertices() corpusSampleTopics.unpersistVertices() }
If there is no checkpoint operation will appear the following problems.
1. The RDD of corpus dependencies are too deep
2. The shuffle files are too large.
3. Any of a server crash will cause the algorithm to recalculate
Attachments
Issue Links
- is related to
-
SPARK-4672 Cut off the super long serialization chain in GraphX to avoid the StackOverflow error
- Resolved
- relates to
-
SPARK-3625 In some cases, the RDD.checkpoint does not work
- Resolved
- links to