Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25959

Difference in featureImportances results on computed vs saved models

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.2.0
    • Fix Version/s: 3.0.0
    • Component/s: ML, MLlib
    • Labels:
      None

      Description

      I tried to implement GBT and found that the feature Importance computed while the model was fit is different when the same model was saved into a storage and loaded back. 

       

      I also found that once the persistent model is loaded and saved back again and loaded, the feature importance remains the same. 

       

      Not sure if its bug while storing and reading the model first time or am missing some parameter that need to be set before saving the model (thus model is picking some defaults - causing feature importance to change)

       

      Below is the test code:

      val testDF = Seq(
      (1, 3, 2, 1, 1),
      (3, 2, 1, 2, 0),
      (2, 2, 1, 1, 0),
      (3, 4, 2, 2, 0),
      (2, 2, 1, 3, 1)
      ).toDF("a", "b", "c", "d", "e")

      val featureColumns = testDF.columns.filter(_ != "e")
      // Assemble the features into a vector
      val assembler = new VectorAssembler().setInputCols(featureColumns).setOutputCol("features")
      // Transform the data to get the feature data set
      val featureDF = assembler.transform(testDF)

      // Train a GBT model.
      val gbt = new GBTClassifier()
      .setLabelCol("e")
      .setFeaturesCol("features")
      .setMaxDepth(2)
      .setMaxBins(5)
      .setMaxIter(10)
      .setSeed(10)
      .fit(featureDF)

      gbt.transform(featureDF).show(false)

      // Write out the model

      featureColumns.zip(gbt.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
      /* Prints

      (d,0.5931875075767403)
      (a,0.3747184548362353)
      (b,0.03209403758702444)
      (c,0.0)

      */
      gbt.write.overwrite().save("file:///tmp/test123")

      println("Reading model again")
      val gbtload = GBTClassificationModel.load("file:///tmp/test123")

      featureColumns.zip(gbtload.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)

      /*

      Prints

      (d,0.6455841215290767)
      (a,0.3316126797964181)
      (b,0.022803198674505094)
      (c,0.0)

      */

      gbtload.write.overwrite().save("file:///tmp/test123_rewrite")

      val gbtload2 = GBTClassificationModel.load("file:///tmp/test123_rewrite")

      featureColumns.zip(gbtload2.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)

      /* prints
      (d,0.6455841215290767)
      (a,0.3316126797964181)
      (b,0.022803198674505094)
      (c,0.0)

      */

        Attachments

          Activity

            People

            • Assignee:
              mgaido Marco Gaido
              Reporter:
              snayakm Suraj Nayak
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: