Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16750

ML GaussianMixture training failed due to feature column type mistake

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.1, 2.1.0
    • ML
    • None

    Description

      ML GaussianMixture training failed due to feature column type mistake. The feature column type should be ml.linalg.VectorUDT but got mllib.linalg.VectorUDT by mistake.
      This bug is easy to reproduce by the following code:

      val df = spark.createDataFrame(
        Seq(
          (1, Vectors.dense(0.0, 1.0, 4.0)),
          (2, Vectors.dense(1.0, 0.0, 4.0)),
          (3, Vectors.dense(1.0, 0.0, 5.0)),
          (4, Vectors.dense(0.0, 0.0, 5.0)))
      ).toDF("id", "features")
      
      val scaler = new MinMaxScaler()
        .setInputCol("features")
        .setOutputCol("features_scaled")
        .setMin(0.0)
        .setMax(5.0)
      
      val gmm = new GaussianMixture()
        .setFeaturesCol("features_scaled")
        .setK(2)
      
      val pipeline = new Pipeline().setStages(Array(scaler, gmm))
      pipeline.fit(df)
      
      requirement failed: Column features_scaled must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
      java.lang.IllegalArgumentException: requirement failed: Column features_scaled must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
      	at scala.Predef$.require(Predef.scala:224)
      	at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
      	at org.apache.spark.ml.clustering.GaussianMixtureParams$class.validateAndTransformSchema(GaussianMixture.scala:64)
      	at org.apache.spark.ml.clustering.GaussianMixture.validateAndTransformSchema(GaussianMixture.scala:275)
      	at org.apache.spark.ml.clustering.GaussianMixture.transformSchema(GaussianMixture.scala:342)
      	at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
      	at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
      	at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
      	at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
      	at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
      	at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180)
      	at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
      	at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132)
      

      Why the unit tests did not complain this errors? Because some estimators/transformers missed calling transformSchema(dataset.schema) firstly during fit or transform. I will also add this function to all estimators/transformers who missed.

      Attachments

        Activity

          People

            yanboliang Yanbo Liang
            yanboliang Yanbo Liang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: