Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Resolved
-
2.2.1
-
None
-
None
Description
val surrogates = $(inputCols).map { inputCol => val ic = col(inputCol) val filtered = dataset.select(ic.cast(DoubleType)) .filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN) if(filtered.take(1).length == 0) { throw new SparkException(s"surrogate cannot be computed. " + s"All the values in $inputCol are Null, Nan or missingValue(${$(missingValue)})") } val surrogate = $(strategy) match { case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first() case Imputer.median => filtered.stat.approxQuantile(inputCol, Array(0.5), 0.001).head } surrogate }
Current impl of Imputer process one column after after another. In this place, we should parallelize the processing in a more efficient way.