Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
3.0.0
-
None
-
None
-
Running Spark in local mode within java application on Windows 10
Description
Running Gaussian from Apache Spark MLlib with [this dataset|https://user.informatik.uni-goettingen.de/~sherbol/MaxDouble.arff] containing values close to MaxDouble (values >10^306) results in the error below. KMeans and Bisecting KMeans can both handle the same dataset which for me raises the question, if this would be a bug or to be expected behavior.
Stacktrace:
org.apache.spark.SparkException: Failed to execute user defined function(GaussianMixtureModel$$Lambda$2841/0x00000001003ab040: (struct<type:tinyint,size:int,indices:array<int>,values:array<double>>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1070)
at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:156)
at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(InterpretedMutableProjection.scala:83)
at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$17.applyOrElse(Optimizer.scala:1502)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:286)
at org.apache.spark.ml.clustering.ClusteringSummary.clusterSizes$lzycompute(ClusteringSummary.scala:49)
at org.apache.spark.ml.clustering.GaussianMixture.fit(GaussianMixture.scala:374)
Caused by: breeze.linalg.NotConvergedException:
at breeze.linalg.eigSym$.breeze$linalg$eigSym$$doEigSym(eig.scala:164)
at breeze.linalg.eigSym$EigSym_DM_Impl$.apply(eig.scala:111)
at breeze.linalg.eigSym$EigSym_DM_Impl$.apply(eig.scala:109)
at breeze.generic.UFunc.apply(UFunc.scala:46)
at breeze.generic.UFunc.apply$(UFunc.scala:45)
at breeze.linalg.eigSym$.apply(eig.scala:106)
at org.apache.spark.ml.stat.distribution.MultivariateGaussian.calculateCovarianceConstants(MultivariateGaussian.scala:117)
at org.apache.spark.ml.stat.distribution.MultivariateGaussian.x$1$lzycompute(MultivariateGaussian.scala:58)
at org.apache.spark.ml.stat.distribution.MultivariateGaussian.x$1(MultivariateGaussian.scala:58)
at org.apache.spark.ml.clustering.GaussianMixtureModel$.computeProbabilities(GaussianMixture.scala:287)
at org.apache.spark.ml.clustering.GaussianMixtureModel.predictProbability(GaussianMixture.scala:171)