Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13568

Create feature transformer to impute missing values

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.2.0
    • Component/s: ML
    • Labels:
      None

      Description

      It is quite common to encounter missing values in data sets. It would be useful to implement a Transformer that can impute missing data points, similar to e.g. Imputer in scikit-learn.

      Initially, options for imputation could include mean, median and most frequent, but we could add various other approaches. Where possible existing DataFrame code can be used (e.g. for approximate quantiles etc).

        Issue Links

          Activity

          Hide
          yuhaoyan yuhao yang added a comment -

          Hi Nick, can I work on this since I kind of already have...
          I got an implementation at https://github.com/hhbyyh/spark/blob/imputer/mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala

          Show
          yuhaoyan yuhao yang added a comment - Hi Nick, can I work on this since I kind of already have... I got an implementation at https://github.com/hhbyyh/spark/blob/imputer/mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala
          Hide
          mlnick Nick Pentreath added a comment -

          Sure, go ahead. However, taking a quick look at your branch, I think the approach needs a bit of discussion.

          I think the Imputer should handle numeric and/or vector columns. If a vector column, the idea is not to impute an entire vector when it is null, but rather the missing (null / NaN) values that may be present in each vector.

          I guess if a vector column itself has missing values (i.e. entire vector is null), then the result would look something like what you have done.

          I tend to think that usage within a pipeline is more likely to be imputing missing values from a set of numeric columns, before applying further transformations into feature vectors. However, we can potentially support all three use cases.

          Show
          mlnick Nick Pentreath added a comment - Sure, go ahead. However, taking a quick look at your branch, I think the approach needs a bit of discussion. I think the Imputer should handle numeric and/or vector columns. If a vector column, the idea is not to impute an entire vector when it is null, but rather the missing (null / NaN) values that may be present in each vector. I guess if a vector column itself has missing values (i.e. entire vector is null), then the result would look something like what you have done. I tend to think that usage within a pipeline is more likely to be imputing missing values from a set of numeric columns, before applying further transformations into feature vectors. However, we can potentially support all three use cases.
          Hide
          yuhaoyan yuhao yang added a comment - - edited

          Yes, I'm working on supporting numeric values too.

          And I agree about the imputation for vector should check the elements in the vector. I intends to support the 3 use cases you mentioned.

          I'll send a PR after some refine and performance benchmark. Thanks

          updated:
          create a new jira to discuss how to handle NaN in Statistics

          Show
          yuhaoyan yuhao yang added a comment - - edited Yes, I'm working on supporting numeric values too. And I agree about the imputation for vector should check the elements in the vector. I intends to support the 3 use cases you mentioned. I'll send a PR after some refine and performance benchmark. Thanks updated: create a new jira to discuss how to handle NaN in Statistics
          Hide
          mlnick Nick Pentreath added a comment -

          Ok - the Imputer will need to compute column stats ignoring NaNs, so SPARK-13639 should add that (whether as default behaviour, or an optional argument)

          Show
          mlnick Nick Pentreath added a comment - Ok - the Imputer will need to compute column stats ignoring NaNs, so SPARK-13639 should add that (whether as default behaviour, or an optional argument)
          Hide
          apachespark Apache Spark added a comment -

          User 'hhbyyh' has created a pull request for this issue:
          https://github.com/apache/spark/pull/11601

          Show
          apachespark Apache Spark added a comment - User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/11601

            People

            • Assignee:
              yuhaoyan yuhao yang
              Reporter:
              mlnick Nick Pentreath
              Shepherd:
              Nick Pentreath
            • Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development