Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16857

CrossValidator and KMeans throws IllegalArgumentException

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • 1.6.1
    • None
    • ML
    • None
    • spark-jobserver docker image. Spark 1.6.1 on ubuntu, Hadoop 2.4

    Description

      I am attempting to use CrossValidation to train KMeans model. When I attempt to fit the data spark throws an IllegalArgumentException as below since the KMeans algorithm outputs an Integer into the prediction column instead of a Double. Before I go too far: is using CrossValidation with Kmeans supported?

      Here's the exception:

      java.lang.IllegalArgumentException: requirement failed: Column prediction must be of type DoubleType but was actually IntegerType.
      at scala.Predef$.require(Predef.scala:233)
      at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
      at org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:74)
      at org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:109)
      at org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:99)
      at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
      at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
      at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:99)
      at com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.generateKMeans(SparkModelJob.scala:202)
      at com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:62)
      at com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:39)
      at spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301)
      at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
      at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      at java.lang.Thread.run(Thread.java:745)

      Here is the code I'm using to set up my cross validator. As the stack trace above indicates it is failing at the fit step when

      ...
      val mpc = new KMeans().setK(2).setFeaturesCol("indexedFeatures")
      val labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)
      val pipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, mpc, labelConverter))

      val evaluator = new MulticlassClassificationEvaluator().setLabelCol("approvedIndex").setPredictionCol("prediction")

      val paramGrid = new ParamGridBuilder().addGrid(mpc.maxIter, Array(100, 200, 500)).build()
      val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3)
      val cvModel = cv.fit(trainingData)

      Attachments

        Activity

          People

            Unassigned Unassigned
            rtclauss Ryan Claussen
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: