Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22801

Allow FeatureHasher to specify numeric columns to treat as categorical

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.3.0
    • Component/s: ML
    • Labels:
      None
    • Target Version/s:

      Description

      FeatureHasher added in SPARK-13964 always treats numeric type columns as numbers and never as categorical features. It is quite common to have categorical features represented as numbers or codes (often say Int) in data sources.

      In order to hash these features as categorical, users must first explicitly convert them to strings which is cumbersome.

      Add a new param categoricalCols which specifies the numeric columns that should be treated as categorical features.

      Note while the reverse case is certainly possible (i.e. numeric features that are encoded as strings and a user would like to treat them as numeric), this is probably less likely and this case won't be supported at this time.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                mlnick Nick Pentreath
                Reporter:
                mlnick Nick Pentreath
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: