Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16832

CrossValidator and TrainValidationSplit are not random without seed

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • 2.0.0
    • None
    • ML, PySpark
    • None

    Description

      Repeatedly running CrossValidator or TrainValidationSplit without an explicit seed parameter does not change results. It is supposed to be seeded with a random seed, but it seems to be instead seeded with some constant. (If seed is explicitly provided, the two classes behave as expected.)

      dataset = spark.createDataFrame(
        [(Vectors.dense([0.0]), 0.0),
         (Vectors.dense([0.4]), 1.0),
         (Vectors.dense([0.5]), 0.0),
         (Vectors.dense([0.6]), 1.0),
         (Vectors.dense([1.0]), 1.0)] * 1000,
        ["features", "label"]).cache()
      
      paramGrid = pyspark.ml.tuning.ParamGridBuilder().build()
      tvs = pyspark.ml.tuning.TrainValidationSplit(estimator=pyspark.ml.regression.LinearRegression(), 
                                 estimatorParamMaps=paramGrid,
                                 evaluator=pyspark.ml.evaluation.RegressionEvaluator(),
                                 trainRatio=0.8)
      model = tvs.fit(train)
      print(model.validationMetrics)
      
      for folds in (3, 5, 10):
        cv = pyspark.ml.tuning.CrossValidator(estimator=pyspark.ml.regression.LinearRegression(), 
                                            estimatorParamMaps=paramGrid, 
                                            evaluator=pyspark.ml.evaluation.RegressionEvaluator(),
                                            numFolds=folds
                                           )
        cvModel = cv.fit(dataset)
        print(folds, cvModel.avgMetrics)
      

      This code produces identical results upon repeated calls.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              mmoroz Max Moroz
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: