Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29856

Conditional unnecessary persist on RDDs in ML algorithms

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 3.0.0
    • Fix Version/s: None
    • Component/s: ML, MLlib
    • Labels:
      None

      Description

      When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD baggedInput in ml.tree.impl.RandomForest.run() is persisted, but it only used once. So this persist operation is unnecessary.

          val baggedInput = BaggedPoint
            .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, withReplacement,
              (tp: TreePoint) => tp.weight, seed = seed)
            .persist(StorageLevel.MEMORY_AND_DISK)
            ...
         while (nodeStack.nonEmpty) {
            ...
            timer.start("findBestSplits")
            RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, nodesForGroup,
              treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache)
            timer.stop("findBestSplits")
          }
          baggedInput.unpersist()
      

      However, the action on baggedInput is in a while loop.
      In GradientBoostedTreeRegressorExample, this loop only executes once, so only one action uses baggedInput.
      In most of ML applications, the loop will executes for many times, which means baggedInput will be used in many actions. So the persist is necessary now.
      That's the point why the persist operation is "conditional" unnecessary.

      Same situations exist in many other ML algorithms, e.g., RDD instances in ml.clustering.KMeans.fit(), RDD indices in mllib.clustering.BisectingKMeans.run().

      This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                spark_cachecheck IcySanwitch
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: