Description
Currently HashingTF works like CountVectorizer (the equivalent in scikit-learn is HashingVectorizer). That is, it works on a sequence of strings and computes term frequencies.
The use cases for feature hashing extend to arbitrary feature values (binary, count or real-valued). For example, scikit-learn's FeatureHasher can accept a sequence of (feature_name, value) pairs (e.g. a map, list). In this way, feature hashing can operate as both "one-hot encoder" and "vector assembler" at the same time.
Investigate adding a more generic feature hasher (that in turn can be used by HashingTF).
Attachments
Issue Links
- relates to
-
SPARK-21468 FeatureHasher Python API
- Resolved
-
SPARK-21469 Add doc and example for FeatureHasher
- Resolved
-
SPARK-23127 Update FeatureHasher user guide for catCols parameter
- Resolved
- links to