Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4081

Categorical feature indexing

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.1.0
    • 1.4.0
    • MLlib
    • None

    Description

      *Updated Description*

      Decision Trees and tree ensembles require that categorical features be indexed 0,1,2.... There is currently no code to aid with indexing a dataset. This is a proposal for a helper class for computing indices (and also deciding which features to treat as categorical).

      Proposed functionality:

      • This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter.
      • This can also map categorical feature values to 0-based indices.

      This is implemented in the spark.ml package for the Pipelines API, and it stores the indexes as column metadata.

      Attachments

        Issue Links

          Activity

            People

              josephkb Joseph K. Bradley
              josephkb Joseph K. Bradley
              Votes:
              2 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: