Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29827

Wrong persist strategy in mllib.clustering.BisectingKMeans.run

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 2.4.3
    • Fix Version/s: None
    • Component/s: MLlib
    • Labels:
      None

      Description

      There are three persist misuses in mllib.clustering.BisectingKMeans.run.

      • First, the rdd input should be persisted, because it was not only used by the action first(), but also used by other __ actions in the following code.
      • Second, the rdd assignments should be persisted. It was used in the fuction summarize() more than once, which containts an action on assignments.
      • Third, once the rdd assignments is persisted_,_ persisting the rdd norms would be unnecessary. Because norms  is an intermediate rdd. Since its child rdd assignments is persisted, it is unnecessary to persist norms anymore.
        private[spark] def run(
            input: RDD[Vector],
            instr: Option[Instrumentation]): BisectingKMeansModel = {
          if (input.getStorageLevel == StorageLevel.NONE) {
            logWarning(s"The input RDD ${input.id} is not directly cached, which may hurt performance if"
              + " its parent RDDs are also not cached.")
          }
          // Needs to persist input
          val d = input.map(_.size).first() 
          logInfo(s"Feature dimension: $d.")
          val dMeasure: DistanceMeasure = DistanceMeasure.decodeFromString(this.distanceMeasure)
          // Compute and cache vector norms for fast distance computation.
          val norms = input.map(v => Vectors.norm(v, 2.0)).persist(StorageLevel.MEMORY_AND_DISK)  // Unnecessary persist
          val vectors = input.zip(norms).map { case (x, norm) => new VectorWithNorm(x, norm) }
          var assignments = vectors.map(v => (ROOT_INDEX, v))  // Needs to persist
          var activeClusters = summarize(d, assignments, dMeasure)
      

      This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                spark_cachecheck IcySanwitch
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: