[SPARK-16750] ML GaussianMixture training failed due to feature column type mistake - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.1, 2.1.0
Component/s: ML
Labels:
None

Description

ML GaussianMixture training failed due to feature column type mistake. The feature column type should be ml.linalg.VectorUDT but got mllib.linalg.VectorUDT by mistake.
This bug is easy to reproduce by the following code:

val df = spark.createDataFrame(
  Seq(
    (1, Vectors.dense(0.0, 1.0, 4.0)),
    (2, Vectors.dense(1.0, 0.0, 4.0)),
    (3, Vectors.dense(1.0, 0.0, 5.0)),
    (4, Vectors.dense(0.0, 0.0, 5.0)))
).toDF("id", "features")

val scaler = new MinMaxScaler()
  .setInputCol("features")
  .setOutputCol("features_scaled")
  .setMin(0.0)
  .setMax(5.0)

val gmm = new GaussianMixture()
  .setFeaturesCol("features_scaled")
  .setK(2)

val pipeline = new Pipeline().setStages(Array(scaler, gmm))
pipeline.fit(df)

requirement failed: Column features_scaled must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
java.lang.IllegalArgumentException: requirement failed: Column features_scaled must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
	at scala.Predef$.require(Predef.scala:224)
	at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
	at org.apache.spark.ml.clustering.GaussianMixtureParams$class.validateAndTransformSchema(GaussianMixture.scala:64)
	at org.apache.spark.ml.clustering.GaussianMixture.validateAndTransformSchema(GaussianMixture.scala:275)
	at org.apache.spark.ml.clustering.GaussianMixture.transformSchema(GaussianMixture.scala:342)
	at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
	at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
	at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
	at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
	at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
	at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180)
	at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
	at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132)

Why the unit tests did not complain this errors? Because some estimators/transformers missed calling transformSchema(dataset.schema) firstly during fit or transform. I will also add this function to all estimators/transformers who missed.

Attachments

Issue Links

links to

[Github] Pull Request #14378 (yanboliang)

[Github] Pull Request #14455 (yanboliang)

Activity

People

Assignee:: Yanbo Liang

Reporter:: Yanbo Liang

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 27/Jul/16 09:43

Updated:: 13/Sep/16 08:17

Resolved:: 29/Jul/16 11:40