Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27293

Setting random seed produces different results in RandomForestRegressor

    XMLWordPrintableJSON

Details

    • Question
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 2.4.0
    • None
    • ML, PySpark
    • None

    Description

      I am interested in finding out if there is a bug in the implementation of RandomForests. The Issue is when applying a seed and getting different results than other people from my class when applying it to the same data

      I am calculating the RMSE metric like this:

      (trainingData, testData) = data.randomSplit([0.7, 0.3], 313)
      from pyspark.ml.regression import RandomForestRegressor
      rfr = RandomForestRegressor(labelCol="labels", featuresCol="features", maxDepth=5, numTrees=3, seed = 313)
      from pyspark.ml.evaluation import RegressionEvaluator
      evaluator = RegressionEvaluator\
      (labelCol="labels", predictionCol="prediction", metricName="rmse")
      rmse = evaluator.evaluate(predictions)
      print("RMSE = %g " % rmse)
      

      I am setting the seed. For seed = 50 and also for other seeds I get exact same RMSE as people from class. I set seed to 313 and it is giving me different value. What could be the issue here?

      Attachments

        Activity

          People

            Unassigned Unassigned
            mskauen1 Martin Skauen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: