Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32092

CrossvalidatorModel does not save all submodels (it saves only 3)

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.0, 2.4.5
    • Fix Version/s: 3.0.1, 3.1.0
    • Component/s: ML, PySpark
    • Labels:
      None
    • Environment:

      Ran on two systems:

      • Local pyspark installation (Windows): spark 2.4.5
      • Spark 2.4.0 on a cluster

      Description

      When saving a CrossValidatorModel with more than 3 subModels and loading again, a different amount of subModels is returned. It seems every time 3 subModels are returned.

      With less than two submodels (so 2 folds) writing plainly fails.

      Issue seems to be (but I am not so familiar with the scala/java side)

      • python object is converted to scala/java
      • in scala we save subModels until numFolds:

       

      val subModelsPath = new Path(path, "subModels") 
             for (splitIndex <- 0 until instance.getNumFolds) {
                val splitPath = new Path(subModelsPath, s"fold${splitIndex.toString}")
                for (paramIndex <- 0 until instance.getEstimatorParamMaps.length) {
                  val modelPath = new Path(splitPath, paramIndex.toString).toString
                  instance.subModels(splitIndex)(paramIndex).asInstanceOf[MLWritable].save(modelPath)
                }
      
      • numFolds is not available on the CrossValidatorModel in pyspark
      • default numFolds is 3 so somehow it tries to save 3 subModels.

      The first issue can be reproduced by following failing tests, where spark is a SparkSession and tmp_path is a (temporary) directory.

      from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, CrossValidatorModel
      from pyspark.ml.classification import LogisticRegression
      from pyspark.ml.evaluation import BinaryClassificationEvaluator
      from pyspark.ml.linalg import Vectors
      
      
      def test_save_load_cross_validator(spark, tmp_path):
          temp_path = str(tmp_path)
          dataset = spark.createDataFrame(
              [
                  (Vectors.dense([0.0]), 0.0),
                  (Vectors.dense([0.4]), 1.0),
                  (Vectors.dense([0.5]), 0.0),
                  (Vectors.dense([0.6]), 1.0),
                  (Vectors.dense([1.0]), 1.0),
              ]
              * 10,
              ["features", "label"],
          )
      
          lr = LogisticRegression()
      
          grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
      
          evaluator = BinaryClassificationEvaluator()
      
          cv = CrossValidator(
              estimator=lr,
              estimatorParamMaps=grid,
              evaluator=evaluator,
              collectSubModels=True,
              numFolds=4,
          )
      
          cvModel = cv.fit(dataset)
      
          # test save/load of CrossValidatorModel
      
          cvModelPath = temp_path + "/cvModel"
      
          cvModel.write().overwrite().save(cvModelPath)
      
          loadedModel = CrossValidatorModel.load(cvModelPath)
          assert len(loadedModel.subModels) == len(cvModel.subModels)
      

       
      The second as follows (will fail writing):

      from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, CrossValidatorModel
      from pyspark.ml.classification import LogisticRegression
      from pyspark.ml.evaluation import BinaryClassificationEvaluator
      from pyspark.ml.linalg import Vectors
      
      
      def test_save_load_cross_validator(spark, tmp_path):
          temp_path = str(tmp_path)
          dataset = spark.createDataFrame(
              [
                  (Vectors.dense([0.0]), 0.0),
                  (Vectors.dense([0.4]), 1.0),
                  (Vectors.dense([0.5]), 0.0),
                  (Vectors.dense([0.6]), 1.0),
                  (Vectors.dense([1.0]), 1.0),
              ]
              * 10,
              ["features", "label"],
          )
      
          lr = LogisticRegression()
      
          grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
      
          evaluator = BinaryClassificationEvaluator()
      
          cv = CrossValidator(
              estimator=lr,
              estimatorParamMaps=grid,
              evaluator=evaluator,
              collectSubModels=True,
              numFolds=2,
          )
      
          cvModel = cv.fit(dataset)
      
          # test save/load of CrossValidatorModel
      
          cvModelPath = temp_path + "/cvModel"
      
          cvModel.write().overwrite().save(cvModelPath)
      
          loadedModel = CrossValidatorModel.load(cvModelPath)
          assert len(loadedModel.subModels) == len(cvModel.subModels)
      

       
       

        Attachments

          Activity

            People

            • Assignee:
              xzrspark Zirui Xu
              Reporter:
              ADR An De Rijdt
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: