[SPARK-21742] BisectingKMeans generate different models with/without caching - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 2.3.0
Fix Version/s: None
Component/s: ML
Labels:
None

Description

I found that BisectingKMeans will generate different models if the input is cached or not.
Using the same dataset in BisectingKMeansSuite, we can found that if we cache the input, then the number of centers will change from 2 to 3.

So it looks like a potential bug.

import org.apache.spark.ml.param.ParamMap
import org.apache.spark.sql.Dataset
import org.apache.spark.ml.clustering._
import org.apache.spark.ml.linalg._
import scala.util.Random
case class TestRow(features: org.apache.spark.ml.linalg.Vector)

val rows = 10
val dim = 1000
val seed = 42
val nnz = 130

val bkm = new BisectingKMeans().setK(5).setMinDivisibleClusterSize(4).setMaxIter(4).setSeed(123)

val random = new Random(seed)
val rdd = sc.parallelize(1 to rows).map(i => Vectors.sparse(dim, random.shuffle(0 to dim - 1).slice(0, nnz).sorted.toArray, Array.fill(nnz)(random.nextDouble()))).map(v => new TestRow(v))

val sparseDataset = spark.createDataFrame(rdd)

scala> bkm.fit(sparseDataset).clusterCenters
17/08/16 17:12:28 WARN BisectingKMeans: The input RDD 579 is not directly cached, which may hurt performance if its parent RDDs are also not cached.
res22: Array[org.apache.spark.ml.linalg.Vector] = Array([0.0,0.0,0.0,0.0,0.0,0.0,0.3081569145071915,0.0,0.0,0.0,0.0,0.1875176493190393,0.0,0.0,0.0,0.33856517726920116,0.0,0.15290274761955236,0.0,0.10820818064086901,0.0,0.0,0.5987249128746422,0.0,0.0,0.3563390364518392,0.0,0.5019914247361699,0.0,0.08711412551574785,0.09199053071837167,0.05749771404790841,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5209441786832834,0.0,0.2350595158678447,0.0,0.0,0.0,0.0,0.0,0.0,0.3041334669892575,0.0,0.0,0.32422664760898434,0.0,0.24542718129722224,0.0,0.0,0.06846136418797384,0.0,0.0,0.19556839035017104,0.0,0.0,0.08436120694800427,0.0,0.0,0.0,0.30542501045554465,0.0,0.0,0.0,0.16185204843664616,0.2800921624973247,0.0,0.45459861318444555,0.0,0.0,0.0,0.26222502250076374,0.5235099131919367,0.0,0.0,0....

scala> bkm.fit(sparseDataset).clusterCenters.length
17/08/16 17:12:36 WARN BisectingKMeans: The input RDD 667 is not directly cached, which may hurt performance if its parent RDDs are also not cached.
res23: Int = 2


scala> sparseDataset.persist()
res24: sparseDataset.type = [features: vector]

scala> bkm.fit(sparseDataset).clusterCenters
17/08/16 17:14:35 WARN BisectingKMeans: The input RDD 806 is not directly cached, which may hurt performance if its parent RDDs are also not cached.
res26: Array[org.apache.spark.ml.linalg.Vector] = Array([0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.562552947957118,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.32462454192260704,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.26134237654724357,0.275971592155115,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.9124004009677724,0.0,0.0,0.972679942826953,0.0,0.7362815438916668,0.0,0.0,0.20538409256392154,0.0,0.0,0.5867051710505131,0.0,0.0,0.0,0.0,0.0,0.0,0.916275031366634,0.0,0.0,0.0,0.4855561453099385,0.0,0.0,0.0,0.0,0.0,0.0,0.7866750675022912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6178027906951924,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.97254915644181,0.0,0.0,0.0,0.0,0.0,0.7947673417631961,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.9685267297437855,0.0,0.0,0.0,0.1...
scala> bkm.fit(sparseDataset).clusterCenters.length
17/08/16 17:14:38 WARN BisectingKMeans: The input RDD 855 is not directly cached, which may hurt performance if its parent RDDs are also not cached.
res27: Int = 3

And suggested by srowen, I retest it with the same dataset generated in a deterministic way, now the results are the same.

val random = new Random(seed)
val rdd = sc.parallelize(1 to rows).map(i => Vectors.sparse(dim, random.shuffle(0 to dim - 1).slice(0, nnz).sorted.toArray, Array.fill(nnz)(random.nextDouble()))).map(v => new TestRow(v))
val vecs = rdd.collect()
val rdd2 = sc.parallelize(vecs)

val sparseDataset2 = spark.createDataFrame(rdd2)

scala> bkm.fit(sparseDataset2).clusterCenters.length
17/08/16 17:20:36 WARN BisectingKMeans: The input RDD 1114 is not directly cached, which may hurt performance if its parent RDDs are also not cached.
res35: Int = 3

scala> bkm.fit(sparseDataset2).clusterCenters
17/08/16 17:20:43 WARN BisectingKMeans: The input RDD 1164 is not directly cached, which may hurt performance if its parent RDDs are also not cached.
res36: Array[org.apache.spark.ml.linalg.Vector] = Array([0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.562552947957118,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.32462454192260704,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.26134237654724357,0.275971592155115,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.9124004009677724,0.0,0.0,0.972679942826953,0.0,0.7362815438916668,0.0,0.0,0.20538409256392154,0.0,0.0,0.5867051710505131,0.0,0.0,0.0,0.0,0.0,0.0,0.916275031366634,0.0,0.0,0.0,0.4855561453099385,0.0,0.0,0.0,0.0,0.0,0.0,0.7866750675022912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6178027906951924,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.97254915644181,0.0,0.0,0.0,0.0,0.0,0.7947673417631961,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.9685267297437855,0.0,0.0,0.0,0.1...

scala> sparseDataset2.persist()
res37: sparseDataset2.type = [features: vector]

scala> bkm.fit(sparseDataset2).clusterCenters.length
17/08/16 17:20:54 WARN BisectingKMeans: The input RDD 1216 is not directly cached, which may hurt performance if its parent RDDs are also not cached.
res38: Int = 3

scala> bkm.fit(sparseDataset2).clusterCenters
17/08/16 17:20:58 WARN BisectingKMeans: The input RDD 1265 is not directly cached, which may hurt performance if its parent RDDs are also not cached.
res39: Array[org.apache.spark.ml.linalg.Vector] = Array([0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.562552947957118,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.32462454192260704,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.26134237654724357,0.275971592155115,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.9124004009677724,0.0,0.0,0.972679942826953,0.0,0.7362815438916668,0.0,0.0,0.20538409256392154,0.0,0.0,0.5867051710505131,0.0,0.0,0.0,0.0,0.0,0.0,0.916275031366634,0.0,0.0,0.0,0.4855561453099385,0.0,0.0,0.0,0.0,0.0,0.0,0.7866750675022912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6178027906951924,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.97254915644181,0.0,0.0,0.0,0.0,0.0,0.7947673417631961,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.9685267297437855,0.0,0.0,0.0,0.1...

BisectingKMeans generate different models with/without caching

Details

Description

Attachments

Activity

People

Dates