[SPARK-25412] FeatureHasher would change the value of output feature - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Bug
Affects Version/s: 2.3.1
Fix Version/s: None
Component/s: ML
Labels:
None

Description

In the current implementation of FeatureHasher.transform, a simple modulo on the hashed value is used to determine the vector index, it's suggested to use a large integer value as the numFeature parameter

we found several issues regarding current implementation:

Cannot get the feature name back by its index after featureHasher transform, for example. when getting feature importance from decision tree training followed by a FeatureHasher
when index conflict, which is a great chance to happen especially when 'numFeature' is relatively small, its value would be changed with a new valued (sum of current and old value)
to avoid confliction, we should set the 'numFeature' with a large number, highly sparse vector increase the computation complexity of model training

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Vincent

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 12/Sep/18 05:43

Updated:: 13/Sep/18 08:17

Resolved:: 13/Sep/18 07:58