Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32676

Fix double caching in KMeans/BiKMeans

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0, 3.1.0
    • Fix Version/s: 3.0.1, 3.1.0
    • Component/s: ML
    • Labels:
      None

      Description

      In the .mllib side, if the storageLevel of input data is always ignored and cached twice:

      @Since("0.8.0")
      def run(data: RDD[Vector]): KMeansModel = {
        val instances = data.map(point => (point, 1.0))
        runWithWeight(instances, None)
      }
       
      private[spark] def runWithWeight(
          data: RDD[(Vector, Double)],
          instr: Option[Instrumentation]): KMeansModel = {
      
        // Compute squared norms and cache them.
        val norms = data.map { case (v, _) =>
          Vectors.norm(v, 2.0)
        }
      
        val zippedData = data.zip(norms).map { case ((v, w), norm) =>
          new VectorWithNorm(v, norm, w)
        }
      
        if (data.getStorageLevel == StorageLevel.NONE) {
          zippedData.persist(StorageLevel.MEMORY_AND_DISK)
        }
        val model = runAlgorithmWithWeight(zippedData, instr)
        zippedData.unpersist()
      
        model
      } 

        Attachments

          Activity

            People

            • Assignee:
              podongfeng zhengruifeng
              Reporter:
              podongfeng zhengruifeng
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: