[SPARK-9011] Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent --> Grid search working on LR but not on RF - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Not A Problem
Affects Version/s: 1.4.0
Fix Version/s: None
Component/s: ML, MLlib, PySpark
Labels:
- cross-validation
- ml
- mllib
- pyspark
- randomforest
- tuning
Environment:

Spark 1.4.0 standalone on top of Hadoop 2.3 on single node running CentOS

Description

Hi,

I ran into this bug while using pyspark.ml.tuning.CrossValidator on an RF (Random Forest) classifier to classify a small dataset using the pyspark.ml.tuning module. (This is a bug because CrossValidator works on LR (Logistic Regression) but not on RF)

Bug:
There is an issue with how BinaryClassificationEvaluator(self, rawPredictionCol="rawPrediction", labelCol="label", metricName="areaUnderROC") interprets the 'rawPredict' column - with LR, the rawPredictionCol is expected to contain vectors, whereas with RF, the prediction column contains doubles.

Suggested Resolution: Either enable BinaryClassificationEvaluator to work with doubles, or let RF output a column rawPredictions containing the probability vectors (with probability of 1 assigned to predicted label, and 0 assigned to the rest).

Detailed Observation:
While running grid search on an RF classifier to classify a small dataset using the pyspark.ml.tuning module, specifically the ParamGridBuilder and CrossValidator classes. I get the following error when I try passing a DataFrame of Features-Labels to CrossValidator:

Py4JJavaError: An error occurred while calling o1464.evaluate.
: java.lang.IllegalArgumentException: requirement failed: Column rawPrediction must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually DoubleType.

I tried the following code, using the dataset given in Spark's CV documentation for cross validator. I also pass the DF through a StringIndexer transformation for the RF:

dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,["features", "label"])
stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
si_model = stringIndexer.fit(dataset)
dataset2 = si_model.transform(dataset)
keep = [dataset2.features, dataset2.indexed]
dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label')
rf = RandomForestClassifier(predictionCol="rawPrediction",featuresCol="features",numTrees=5, maxDepth=7)
grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset3)

Note that the above dataset works on logistic regression. I have also tried a larger dataset with sparse vectors as features (which I was originally trying to fit) but received the same error on RF.

Attachments

Issue Links

is related to

SPARK-9016 Make random forest classifier extend Classifier abstraction

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Shivam Verma

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 13/Jul/15 09:08

Updated:: 13/Jul/15 23:46

Resolved:: 13/Jul/15 23:46