Description
1, input vector validation is missing in most algorithms, when the input dataset contains some invalid values (NaN/Infinity), then:
- the training may run successfuly and return model containing invalid coefficients, like LinearSVC
- the training may fail with irrelevant message, like KMeans
import org.apache.spark.ml.feature._ import org.apache.spark.ml.linalg._ import org.apache.spark.ml.classification._ import org.apache.spark.ml.clustering._ val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 2.0)))).toDF() val svc = new LinearSVC() val model = svc.fit(df) scala> model.intercept res0: Double = NaN scala> model.coefficients res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN] val km = new KMeans().setK(2) scala> km.fit(df) 22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID 113) java.lang.IllegalArgumentException: requirement failed: Both norms should be greater or equal to 0.0, found norm1=NaN, norm2=Infinity at scala.Predef$.require(Predef.scala:281) at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543)
We should make ml algorithms fail fast, if the input dataset is invalid.
2, there exists some methods to validate input labels and weights in different files:
- org.apache.spark.ml.functions
- org.apache.spark.ml.util.DatasetUtils
- org.apache.spark.ml.util.MetadataUtils,
- org.apache.spark.ml.Predictor
- etc.
I think it is time to unify realtive methods to one source file.
Attachments
1.
|
Validate input dataset of ml.classification | Resolved | Unassigned | |
2.
|
Validate input dataset of ml.regression | Resolved | Ruifeng Zheng | |
3.
|
Validate input dataset of ml.clustering | Resolved | Ruifeng Zheng | |
4.
|
cleanup validation functions | Resolved | Ruifeng Zheng |