Description
Repeatedly running CrossValidator or TrainValidationSplit without an explicit seed parameter does not change results. It is supposed to be seeded with a random seed, but it seems to be instead seeded with some constant. (If seed is explicitly provided, the two classes behave as expected.)
dataset = spark.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 1000, ["features", "label"]).cache() paramGrid = pyspark.ml.tuning.ParamGridBuilder().build() tvs = pyspark.ml.tuning.TrainValidationSplit(estimator=pyspark.ml.regression.LinearRegression(), estimatorParamMaps=paramGrid, evaluator=pyspark.ml.evaluation.RegressionEvaluator(), trainRatio=0.8) model = tvs.fit(train) print(model.validationMetrics) for folds in (3, 5, 10): cv = pyspark.ml.tuning.CrossValidator(estimator=pyspark.ml.regression.LinearRegression(), estimatorParamMaps=paramGrid, evaluator=pyspark.ml.evaluation.RegressionEvaluator(), numFolds=folds ) cvModel = cv.fit(dataset) print(folds, cvModel.avgMetrics)
This code produces identical results upon repeated calls.
Attachments
Issue Links
- relates to
-
SPARK-17311 Standardize Python-Java MLlib API to accept optional long seeds in all cases
- Resolved
- links to