[SPARK-21690] one-pass imputer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Resolved
Affects Version/s: 2.2.1
Fix Version/s: None
Component/s: ML
Labels:
None

Description

    val surrogates = $(inputCols).map { inputCol =>
      val ic = col(inputCol)
      val filtered = dataset.select(ic.cast(DoubleType))
        .filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN)
      if(filtered.take(1).length == 0) {
        throw new SparkException(s"surrogate cannot be computed. " +
          s"All the values in $inputCol are Null, Nan or missingValue(${$(missingValue)})")
      }
      val surrogate = $(strategy) match {
        case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first()
        case Imputer.median => filtered.stat.approxQuantile(inputCol, Array(0.5), 0.001).head
      }
      surrogate
    }

Current impl of Imputer process one column after after another. In this place, we should parallelize the processing in a more efficient way.

Attachments

Issue Links

links to

https://github.com/apache/spark/pull/18902

Activity

People

Assignee:: Ruifeng Zheng

Reporter:: Ruifeng Zheng

Shepherd:: Yanbo Liang

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 10/Aug/17 06:06

Updated:: 09/Oct/17 06:27

Resolved:: 09/Oct/17 06:27