Description
FeatureHasher added in SPARK-13964 always treats numeric type columns as numbers and never as categorical features. It is quite common to have categorical features represented as numbers or codes (often say Int) in data sources.
In order to hash these features as categorical, users must first explicitly convert them to strings which is cumbersome.
Add a new param categoricalCols which specifies the numeric columns that should be treated as categorical features.
Note while the reverse case is certainly possible (i.e. numeric features that are encoded as strings and a user would like to treat them as numeric), this is probably less likely and this case won't be supported at this time.
Attachments
Issue Links
- is related to
-
SPARK-13964 Feature hashing improvements
- Resolved
- relates to
-
SPARK-23127 Update FeatureHasher user guide for catCols parameter
- Resolved
- links to