Description
It is quite common to encounter missing values in data sets. It would be useful to implement a Transformer that can impute missing data points, similar to e.g. Imputer in scikit-learn.
Initially, options for imputation could include mean, median and most frequent, but we could add various other approaches. Where possible existing DataFrame code can be used (e.g. for approximate quantiles etc).
Attachments
Issue Links
- blocks
-
SPARK-15040 PySpark impl for ml.feature.Imputer
- Resolved
-
SPARK-15041 adding mode strategy for ml.feature.Imputer for categorical features
- Resolved
-
SPARK-19969 Doc and examples for Imputer
- Resolved
- relates to
-
SPARK-13639 Statistics.colStats(rdd).mean and variance should handle NaN in the input vectors
- Resolved
- links to