[SPARK-29856] Conditional unnecessary persist on RDDs in ML algorithms - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 3.0.0
Fix Version/s: None
Component/s: ML, MLlib
Labels:
None

Description

When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD baggedInput in ml.tree.impl.RandomForest.run() is persisted, but it only used once. So this persist operation is unnecessary.

    val baggedInput = BaggedPoint
      .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, withReplacement,
        (tp: TreePoint) => tp.weight, seed = seed)
      .persist(StorageLevel.MEMORY_AND_DISK)
      ...
   while (nodeStack.nonEmpty) {
      ...
      timer.start("findBestSplits")
      RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, nodesForGroup,
        treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache)
      timer.stop("findBestSplits")
    }
    baggedInput.unpersist()

However, the action on baggedInput is in a while loop.
In GradientBoostedTreeRegressorExample, this loop only executes once, so only one action uses baggedInput.
In most of ML applications, the loop will executes for many times, which means baggedInput will be used in many actions. So the persist is necessary now.
That's the point why the persist operation is "conditional" unnecessary.

Same situations exist in many other ML algorithms, e.g., RDD instances in ml.clustering.KMeans.fit(), RDD indices in mllib.clustering.BisectingKMeans.run().

This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses.

Attachments

Issue Links

duplicates

SPARK-29872 Improper cache strategy in examples

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: IcySanwitch

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 12/Nov/19 07:13

Updated:: 16/Nov/19 20:47

Resolved:: 16/Nov/19 20:47