[SPARK-32092] CrossvalidatorModel does not save all submodels (it saves only 3) - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.0, 2.4.5
Fix Version/s: 3.0.1, 3.1.0
Component/s: ML, PySpark
Labels:
None
Environment:
Ran on two systems:
- Local pyspark installation (Windows): spark 2.4.5
- Spark 2.4.0 on a cluster

Description

When saving a CrossValidatorModel with more than 3 subModels and loading again, a different amount of subModels is returned. It seems every time 3 subModels are returned.

With less than two submodels (so 2 folds) writing plainly fails.

Issue seems to be (but I am not so familiar with the scala/java side)

python object is converted to scala/java
in scala we save subModels until numFolds:

val subModelsPath = new Path(path, "subModels") 
       for (splitIndex <- 0 until instance.getNumFolds) {
          val splitPath = new Path(subModelsPath, s"fold${splitIndex.toString}")
          for (paramIndex <- 0 until instance.getEstimatorParamMaps.length) {
            val modelPath = new Path(splitPath, paramIndex.toString).toString
            instance.subModels(splitIndex)(paramIndex).asInstanceOf[MLWritable].save(modelPath)
          }

numFolds is not available on the CrossValidatorModel in pyspark
default numFolds is 3 so somehow it tries to save 3 subModels.

The first issue can be reproduced by following failing tests, where spark is a SparkSession and tmp_path is a (temporary) directory.

from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, CrossValidatorModel
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.linalg import Vectors


def test_save_load_cross_validator(spark, tmp_path):
    temp_path = str(tmp_path)
    dataset = spark.createDataFrame(
        [
            (Vectors.dense([0.0]), 0.0),
            (Vectors.dense([0.4]), 1.0),
            (Vectors.dense([0.5]), 0.0),
            (Vectors.dense([0.6]), 1.0),
            (Vectors.dense([1.0]), 1.0),
        ]
        * 10,
        ["features", "label"],
    )

    lr = LogisticRegression()

    grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()

    evaluator = BinaryClassificationEvaluator()

    cv = CrossValidator(
        estimator=lr,
        estimatorParamMaps=grid,
        evaluator=evaluator,
        collectSubModels=True,
        numFolds=4,
    )

    cvModel = cv.fit(dataset)

    # test save/load of CrossValidatorModel

    cvModelPath = temp_path + "/cvModel"

    cvModel.write().overwrite().save(cvModelPath)

    loadedModel = CrossValidatorModel.load(cvModelPath)
    assert len(loadedModel.subModels) == len(cvModel.subModels)

The second as follows (will fail writing):

from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, CrossValidatorModel
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.linalg import Vectors


def test_save_load_cross_validator(spark, tmp_path):
    temp_path = str(tmp_path)
    dataset = spark.createDataFrame(
        [
            (Vectors.dense([0.0]), 0.0),
            (Vectors.dense([0.4]), 1.0),
            (Vectors.dense([0.5]), 0.0),
            (Vectors.dense([0.6]), 1.0),
            (Vectors.dense([1.0]), 1.0),
        ]
        * 10,
        ["features", "label"],
    )

    lr = LogisticRegression()

    grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()

    evaluator = BinaryClassificationEvaluator()

    cv = CrossValidator(
        estimator=lr,
        estimatorParamMaps=grid,
        evaluator=evaluator,
        collectSubModels=True,
        numFolds=2,
    )

    cvModel = cv.fit(dataset)

    # test save/load of CrossValidatorModel

    cvModelPath = temp_path + "/cvModel"

    cvModel.write().overwrite().save(cvModelPath)

    loadedModel = CrossValidatorModel.load(cvModelPath)
    assert len(loadedModel.subModels) == len(cvModel.subModels)