Description
ML GaussianMixture training failed due to feature column type mistake. The feature column type should be ml.linalg.VectorUDT but got mllib.linalg.VectorUDT by mistake.
This bug is easy to reproduce by the following code:
val df = spark.createDataFrame( Seq( (1, Vectors.dense(0.0, 1.0, 4.0)), (2, Vectors.dense(1.0, 0.0, 4.0)), (3, Vectors.dense(1.0, 0.0, 5.0)), (4, Vectors.dense(0.0, 0.0, 5.0))) ).toDF("id", "features") val scaler = new MinMaxScaler() .setInputCol("features") .setOutputCol("features_scaled") .setMin(0.0) .setMax(5.0) val gmm = new GaussianMixture() .setFeaturesCol("features_scaled") .setK(2) val pipeline = new Pipeline().setStages(Array(scaler, gmm)) pipeline.fit(df) requirement failed: Column features_scaled must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7. java.lang.IllegalArgumentException: requirement failed: Column features_scaled must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42) at org.apache.spark.ml.clustering.GaussianMixtureParams$class.validateAndTransformSchema(GaussianMixture.scala:64) at org.apache.spark.ml.clustering.GaussianMixture.validateAndTransformSchema(GaussianMixture.scala:275) at org.apache.spark.ml.clustering.GaussianMixture.transformSchema(GaussianMixture.scala:342) at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180) at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70) at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132)
Why the unit tests did not complain this errors? Because some estimators/transformers missed calling transformSchema(dataset.schema) firstly during fit or transform. I will also add this function to all estimators/transformers who missed.