Description
When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD baggedInput in ml.tree.impl.RandomForest.run() is persisted, but it only used once. So this persist operation is unnecessary.
val baggedInput = BaggedPoint .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, withReplacement, (tp: TreePoint) => tp.weight, seed = seed) .persist(StorageLevel.MEMORY_AND_DISK) ... while (nodeStack.nonEmpty) { ... timer.start("findBestSplits") RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, nodesForGroup, treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache) timer.stop("findBestSplits") } baggedInput.unpersist()
However, the action on baggedInput is in a while loop.
In GradientBoostedTreeRegressorExample, this loop only executes once, so only one action uses baggedInput.
In most of ML applications, the loop will executes for many times, which means baggedInput will be used in many actions. So the persist is necessary now.
That's the point why the persist operation is "conditional" unnecessary.
Same situations exist in many other ML algorithms, e.g., RDD instances in ml.clustering.KMeans.fit(), RDD indices in mllib.clustering.BisectingKMeans.run().
This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses.
Attachments
Issue Links
- duplicates
-
SPARK-29872 Improper cache strategy in examples
- Resolved