Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21690

one-pass imputer

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Resolved
    • 2.2.1
    • None
    • ML
    • None

    Description

          val surrogates = $(inputCols).map { inputCol =>
            val ic = col(inputCol)
            val filtered = dataset.select(ic.cast(DoubleType))
              .filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN)
            if(filtered.take(1).length == 0) {
              throw new SparkException(s"surrogate cannot be computed. " +
                s"All the values in $inputCol are Null, Nan or missingValue(${$(missingValue)})")
            }
            val surrogate = $(strategy) match {
              case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first()
              case Imputer.median => filtered.stat.approxQuantile(inputCol, Array(0.5), 0.001).head
            }
            surrogate
          }
      

      Current impl of Imputer process one column after after another. In this place, we should parallelize the processing in a more efficient way.

      Attachments

        Activity

          People

            podongfeng Ruifeng Zheng
            podongfeng Ruifeng Zheng
            Yanbo Liang Yanbo Liang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: