[SPARK-13568] Create feature transformer to impute missing values - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.2.0
Component/s: ML
Labels:
None

Description

It is quite common to encounter missing values in data sets. It would be useful to implement a Transformer that can impute missing data points, similar to e.g. Imputer in scikit-learn.

Initially, options for imputation could include mean, median and most frequent, but we could add various other approaches. Where possible existing DataFrame code can be used (e.g. for approximate quantiles etc).

Attachments

Issue Links

blocks

SPARK-15040 PySpark impl for ml.feature.Imputer

Resolved

SPARK-15041 adding mode strategy for ml.feature.Imputer for categorical features

Resolved

SPARK-19969 Doc and examples for Imputer

Resolved

relates to

SPARK-13639 Statistics.colStats(rdd).mean and variance should handle NaN in the input vectors

Resolved

links to

[Github] Pull Request #11601 (hhbyyh)

Activity

People

Assignee:: yuhao yang

Reporter:: Nicholas Pentreath

Shepherd:: Nicholas Pentreath

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 29/Feb/16 14:07

Updated:: 16/Mar/17 10:51

Resolved:: 16/Mar/17 10:51