Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32569

Gaussian can not handle data close to MaxDouble

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 3.0.0
    • None
    • MLlib
    • None
    • Running Spark in local mode within java application on Windows 10

    Description

      Running Gaussian from Apache Spark MLlib with [this dataset|https://user.informatik.uni-goettingen.de/~sherbol/MaxDouble.arff] containing values close to MaxDouble (values >10^306) results in the error below. KMeans and Bisecting KMeans can both handle the same dataset which for me raises the question, if this would be a bug or to be expected behavior.

      Stacktrace:

      org.apache.spark.SparkException: Failed to execute user defined function(GaussianMixtureModel$$Lambda$2841/0x00000001003ab040: (struct<type:tinyint,size:int,indices:array<int>,values:array<double>>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
      at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1070)
      at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:156)
      at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(InterpretedMutableProjection.scala:83)

      at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$17.applyOrElse(Optimizer.scala:1502)
      at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:286)

      at org.apache.spark.ml.clustering.ClusteringSummary.clusterSizes$lzycompute(ClusteringSummary.scala:49)
      at org.apache.spark.ml.clustering.GaussianMixture.fit(GaussianMixture.scala:374)

      Caused by: breeze.linalg.NotConvergedException:
      at breeze.linalg.eigSym$.breeze$linalg$eigSym$$doEigSym(eig.scala:164)
      at breeze.linalg.eigSym$EigSym_DM_Impl$.apply(eig.scala:111)
      at breeze.linalg.eigSym$EigSym_DM_Impl$.apply(eig.scala:109)
      at breeze.generic.UFunc.apply(UFunc.scala:46)
      at breeze.generic.UFunc.apply$(UFunc.scala:45)
      at breeze.linalg.eigSym$.apply(eig.scala:106)
      at org.apache.spark.ml.stat.distribution.MultivariateGaussian.calculateCovarianceConstants(MultivariateGaussian.scala:117)
      at org.apache.spark.ml.stat.distribution.MultivariateGaussian.x$1$lzycompute(MultivariateGaussian.scala:58)
      at org.apache.spark.ml.stat.distribution.MultivariateGaussian.x$1(MultivariateGaussian.scala:58)

      at org.apache.spark.ml.clustering.GaussianMixtureModel$.computeProbabilities(GaussianMixture.scala:287)
      at org.apache.spark.ml.clustering.GaussianMixtureModel.predictProbability(GaussianMixture.scala:171)

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            thaar Tobias Haar
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: