[SPARK-13969] Extend input format that feature hashing can handle - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.3.0
Component/s: ML, MLlib
Labels:
None

Description

Currently HashingTF works like CountVectorizer (the equivalent in scikit-learn is HashingVectorizer). That is, it works on a sequence of strings and computes term frequencies.

The use cases for feature hashing extend to arbitrary feature values (binary, count or real-valued). For example, scikit-learn's FeatureHasher can accept a sequence of (feature_name, value) pairs (e.g. a map, list). In this way, feature hashing can operate as both "one-hot encoder" and "vector assembler" at the same time.

Investigate adding a more generic feature hasher (that in turn can be used by HashingTF).

Attachments

Issue Links

relates to

SPARK-21468 FeatureHasher Python API

Resolved

SPARK-21469 Add doc and example for FeatureHasher

Resolved

SPARK-23127 Update FeatureHasher user guide for catCols parameter

Resolved

links to

[Github] Pull Request #18513 (MLnick)

Activity

People

Assignee:: Nicholas Pentreath

Reporter:: Nicholas Pentreath

Votes:: 3 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 17/Mar/16 09:44

Updated:: 17/Jan/18 08:55

Resolved:: 16/Aug/17 08:55