I ran into this bug while using pyspark.ml.tuning.CrossValidator on an RF (Random Forest) classifier to classify a small dataset using the pyspark.ml.tuning module. (This is a bug because CrossValidator works on LR (Logistic Regression) but not on RF)
There is an issue with how BinaryClassificationEvaluator(self, rawPredictionCol="rawPrediction", labelCol="label", metricName="areaUnderROC") interprets the 'rawPredict' column - with LR, the rawPredictionCol is expected to contain vectors, whereas with RF, the prediction column contains doubles.
Suggested Resolution: Either enable BinaryClassificationEvaluator to work with doubles, or let RF output a column rawPredictions containing the probability vectors (with probability of 1 assigned to predicted label, and 0 assigned to the rest).
While running grid search on an RF classifier to classify a small dataset using the pyspark.ml.tuning module, specifically the ParamGridBuilder and CrossValidator classes. I get the following error when I try passing a DataFrame of Features-Labels to CrossValidator:
I tried the following code, using the dataset given in Spark's CV documentation for cross validator. I also pass the DF through a StringIndexer transformation for the RF:
Note that the above dataset works on logistic regression. I have also tried a larger dataset with sparse vectors as features (which I was originally trying to fit) but received the same error on RF.