[SPARK-27293] Setting random seed produces different results in RandomForestRegressor - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Question
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 2.4.0
Fix Version/s: None
Component/s: ML, PySpark
Labels:
None

Description

I am interested in finding out if there is a bug in the implementation of RandomForests. The Issue is when applying a seed and getting different results than other people from my class when applying it to the same data

I am calculating the RMSE metric like this:

(trainingData, testData) = data.randomSplit([0.7, 0.3], 313)
from pyspark.ml.regression import RandomForestRegressor
rfr = RandomForestRegressor(labelCol="labels", featuresCol="features", maxDepth=5, numTrees=3, seed = 313)
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator\
(labelCol="labels", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("RMSE = %g " % rmse)

I am setting the seed. For seed = 50 and also for other seeds I get exact same RMSE as people from class. I set seed to 313 and it is giving me different value. What could be the issue here?

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Martin Skauen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 27/Mar/19 11:10

Updated:: 29/Mar/19 00:34

Resolved:: 29/Mar/19 00:34