Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-38584

Unify the data validation

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 3.4.0
    • ML
    • None

    Description

      1, input vector validation is missing in most algorithms, when the input dataset contains some invalid values (NaN/Infinity), then:

      • the training may run successfuly and return model containing invalid coefficients, like LinearSVC
      • the training may fail with irrelevant message, like KMeans

       

      import org.apache.spark.ml.feature._
      import org.apache.spark.ml.linalg._
      import org.apache.spark.ml.classification._
      import org.apache.spark.ml.clustering._
      val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 2.0)))).toDF()
      
      val svc = new LinearSVC()
      val model = svc.fit(df)
      
      scala> model.intercept
      res0: Double = NaN
      
      scala> model.coefficients
      res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN]
      
      val km = new KMeans().setK(2)
      scala> km.fit(df)
      22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID 113)
      java.lang.IllegalArgumentException: requirement failed: Both norms should be greater or equal to 0.0, found norm1=NaN, norm2=Infinity
          at scala.Predef$.require(Predef.scala:281)
          at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543)
      

       

      We should make ml algorithms fail fast, if the input dataset is invalid.

       

      2, there exists some methods to validate input labels and weights in different files:

      • org.apache.spark.ml.functions
      • org.apache.spark.ml.util.DatasetUtils
      • org.apache.spark.ml.util.MetadataUtils,
      • org.apache.spark.ml.Predictor
      • etc.

       

      I think it is time to unify realtive methods to one source file.

       

      Attachments

        Activity

          People

            podongfeng Ruifeng Zheng
            podongfeng Ruifeng Zheng
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: