Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-38584 Unify the data validation
  3. SPARK-38588

Validate input dataset of ml.classification

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Minor
    • Resolution: Resolved
    • 3.4.0
    • 3.4.0
    • ML
    • None

    Description

      LinearSVC should fail fast if the input dataset contains invalid values.

       

      import org.apache.spark.ml.feature._
      import org.apache.spark.ml.linalg._
      import org.apache.spark.ml.classification._
      import org.apache.spark.ml.clustering._
      val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 2.0)))).toDF()
      
      val svc = new LinearSVC()
      val model = svc.fit(df)
      
      scala> model.intercept
      res0: Double = NaN
      
      scala> model.coefficients
      res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN] 

      Attachments

        Activity

          People

            Unassigned Unassigned
            podongfeng Ruifeng Zheng
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: