Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22801

Allow FeatureHasher to specify numeric columns to treat as categorical

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 2.3.0
    • ML
    • None

    Description

      FeatureHasher added in SPARK-13964 always treats numeric type columns as numbers and never as categorical features. It is quite common to have categorical features represented as numbers or codes (often say Int) in data sources.

      In order to hash these features as categorical, users must first explicitly convert them to strings which is cumbersome.

      Add a new param categoricalCols which specifies the numeric columns that should be treated as categorical features.

      Note while the reverse case is certainly possible (i.e. numeric features that are encoded as strings and a user would like to treat them as numeric), this is probably less likely and this case won't be supported at this time.

      Attachments

        Issue Links

          Activity

            People

              mlnick Nicholas Pentreath
              mlnick Nicholas Pentreath
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: