[SPARK-38584] Unify the data validation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 3.4.0
Component/s: ML
Labels:
None

Description

1, input vector validation is missing in most algorithms, when the input dataset contains some invalid values (NaN/Infinity), then:

the training may run successfuly and return model containing invalid coefficients, like LinearSVC
the training may fail with irrelevant message, like KMeans

import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.classification._
import org.apache.spark.ml.clustering._
val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 2.0)))).toDF()

val svc = new LinearSVC()
val model = svc.fit(df)

scala> model.intercept
res0: Double = NaN

scala> model.coefficients
res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN]

val km = new KMeans().setK(2)
scala> km.fit(df)
22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID 113)
java.lang.IllegalArgumentException: requirement failed: Both norms should be greater or equal to 0.0, found norm1=NaN, norm2=Infinity
    at scala.Predef$.require(Predef.scala:281)
    at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543)

We should make ml algorithms fail fast, if the input dataset is invalid.

2, there exists some methods to validate input labels and weights in different files:

org.apache.spark.ml.functions
org.apache.spark.ml.util.DatasetUtils
org.apache.spark.ml.util.MetadataUtils,
org.apache.spark.ml.Predictor
etc.

I think it is time to unify realtive methods to one source file.

Attachments

Sub-Tasks

1.	Validate input dataset of ml.classification	Resolved	Unassigned
2.	Validate input dataset of ml.regression	Resolved	Ruifeng Zheng
3.	Validate input dataset of ml.clustering	Resolved	Ruifeng Zheng
4.	cleanup validation functions	Resolved	Ruifeng Zheng

Activity

People

Assignee:: Ruifeng Zheng

Reporter:: Ruifeng Zheng

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 17/Mar/22 06:51

Updated:: 17/Mar/23 05:27

Resolved:: 02/Apr/22 14:42