Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-9011

Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent --> Grid search working on LR but not on RF

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments




      I ran into this bug while using pyspark.ml.tuning.CrossValidator on an RF (Random Forest) classifier to classify a small dataset using the pyspark.ml.tuning module. (This is a bug because CrossValidator works on LR (Logistic Regression) but not on RF)

      There is an issue with how BinaryClassificationEvaluator(self, rawPredictionCol="rawPrediction", labelCol="label", metricName="areaUnderROC") interprets the 'rawPredict' column - with LR, the rawPredictionCol is expected to contain vectors, whereas with RF, the prediction column contains doubles.

      Suggested Resolution: Either enable BinaryClassificationEvaluator to work with doubles, or let RF output a column rawPredictions containing the probability vectors (with probability of 1 assigned to predicted label, and 0 assigned to the rest).

      Detailed Observation:
      While running grid search on an RF classifier to classify a small dataset using the pyspark.ml.tuning module, specifically the ParamGridBuilder and CrossValidator classes. I get the following error when I try passing a DataFrame of Features-Labels to CrossValidator:

      Py4JJavaError: An error occurred while calling o1464.evaluate.
      : java.lang.IllegalArgumentException: requirement failed: Column rawPrediction must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually DoubleType.

      I tried the following code, using the dataset given in Spark's CV documentation for cross validator. I also pass the DF through a StringIndexer transformation for the RF:

      dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,["features", "label"])
      stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
      si_model = stringIndexer.fit(dataset)
      dataset2 = si_model.transform(dataset)
      keep = [dataset2.features, dataset2.indexed]
      dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label')
      rf = RandomForestClassifier(predictionCol="rawPrediction",featuresCol="features",numTrees=5, maxDepth=7)
      grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build()
      evaluator = BinaryClassificationEvaluator()
      cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator)
      cvModel = cv.fit(dataset3)

      Note that the above dataset works on logistic regression. I have also tried a larger dataset with sparse vectors as features (which I was originally trying to fit) but received the same error on RF.


        Issue Links



            • Assignee:
              shivamverma Shivam Verma


              • Created:

                Issue deployment