It is quite common to encounter missing values in data sets. It would be useful to implement a Transformer that can impute missing data points, similar to e.g. Imputer in scikit-learn.
Initially, options for imputation could include mean, median and most frequent, but we could add various other approaches. Where possible existing DataFrame code can be used (e.g. for approximate quantiles etc).
- blocks
-
SPARK-15040 PySpark impl for ml.feature.Imputer
-
- Resolved
-
-
SPARK-15041 adding mode strategy for ml.feature.Imputer for categorical features
-
- Resolved
-
-
SPARK-19969 Doc and examples for Imputer
-
- Resolved
-
- relates to
-
SPARK-13639 Statistics.colStats(rdd).mean and variance should handle NaN in the input vectors
-
- Resolved
-
- links to