Details
Description
When saving a CrossValidatorModel with more than 3 subModels and loading again, a different amount of subModels is returned. It seems every time 3 subModels are returned.
With less than two submodels (so 2 folds) writing plainly fails.
Issue seems to be (but I am not so familiar with the scala/java side)
- python object is converted to scala/java
- in scala we save subModels until numFolds:
val subModelsPath = new Path(path, "subModels") for (splitIndex <- 0 until instance.getNumFolds) { val splitPath = new Path(subModelsPath, s"fold${splitIndex.toString}") for (paramIndex <- 0 until instance.getEstimatorParamMaps.length) { val modelPath = new Path(splitPath, paramIndex.toString).toString instance.subModels(splitIndex)(paramIndex).asInstanceOf[MLWritable].save(modelPath) }
- numFolds is not available on the CrossValidatorModel in pyspark
- default numFolds is 3 so somehow it tries to save 3 subModels.
The first issue can be reproduced by following failing tests, where spark is a SparkSession and tmp_path is a (temporary) directory.
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, CrossValidatorModel from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.ml.linalg import Vectors def test_save_load_cross_validator(spark, tmp_path): temp_path = str(tmp_path) dataset = spark.createDataFrame( [ (Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0), ] * 10, ["features", "label"], ) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator( estimator=lr, estimatorParamMaps=grid, evaluator=evaluator, collectSubModels=True, numFolds=4, ) cvModel = cv.fit(dataset) # test save/load of CrossValidatorModel cvModelPath = temp_path + "/cvModel" cvModel.write().overwrite().save(cvModelPath) loadedModel = CrossValidatorModel.load(cvModelPath) assert len(loadedModel.subModels) == len(cvModel.subModels)
The second as follows (will fail writing):
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, CrossValidatorModel from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.ml.linalg import Vectors def test_save_load_cross_validator(spark, tmp_path): temp_path = str(tmp_path) dataset = spark.createDataFrame( [ (Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0), ] * 10, ["features", "label"], ) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator( estimator=lr, estimatorParamMaps=grid, evaluator=evaluator, collectSubModels=True, numFolds=2, ) cvModel = cv.fit(dataset) # test save/load of CrossValidatorModel cvModelPath = temp_path + "/cvModel" cvModel.write().overwrite().save(cvModelPath) loadedModel = CrossValidatorModel.load(cvModelPath) assert len(loadedModel.subModels) == len(cvModel.subModels)